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EDITORIAL 


To solve climate, first achieve peace 


he horrific invasion of Ukraine by Russia has 
many devastating effects. The most immediate 
are on the people of Ukraine, but the long-term 
implications for the entire planet are enormous. 
For science, the disruption to international col- 
laboration must be addressed and we must give 
our strongest support to Ukrainian scientists, as 
outlined by Marcia McNutt and John Hildebrand in a re- 
cent Science editorial. But for climate change, the effects 
may be the greatest. If we want a positive energy future 
for a healthier climate, the West must start by recasting 
foreign policy with climate and energy issues at the fore- 
front. That can only succeed if nations strengthen the 
commitment to settle differences with diplomacy, not 
war. The only truly life-sustaining 
climate will be one accompanied by 
international peace. 

The impact of climate change on 
national and global security is not 
new. But this association was not 
always front and center. The push 
for renewable energy has deep 
roots in geopolitical and security 
concerns rather than in climate. 
But for too long, this geopolitical 
tension has been kept in a separate 
bucket from climate dangers. Last 
week, in addressing the Ukraine 
war, President Biden pledged that 
the United States would help Eu- 
rope become less dependent on 
Russian oil. But nothing was said 
about climate. People should be made aware that the fu- 
ture of the planet is inextricably intertwined with inter- 
national conflict. Many fossil-fuel-dependent countries, 
including the United States, have long been subjected 
to a volatile global oil market that is largely controlled 
by the Middle East. The war in Ukraine has served as a 
grim reminder that Europe has allowed itself to become 
dependent on fossil fuels from Russia—a country that 
is now an enemy. Without Russian oil, which nations 
can Europe turn to for fossil fuels without leaving itself 
the potential victim of future conflicts elsewhere on the 
planet? And that may be a short-term solution while it 
tries to transition to renewable energy. 

Meanwhile, China keeps building coal-fired electric- 
ity plants. Why aren’t countries working with China to 
slow this down? While the United States hurls invec- 
tives toward Beijing about the origins of COVID-19, eco- 
nomic espionage, and intellectual property protection, 


“athe future 
of the planet 
is inextricably 


intertwined 
with international 
conflict.” 


it is losing the opportunity to cooperate with China on 
energy sources cleaner than coal. Is a purely adversarial 
foreign policy the best strategy when China produces 
greenhouse gases (more than any other country in the 
world) at its discretion? 

Unfortunately, solar and wind power won’t provide 
enough energy in the foreseeable future. That leaves an 
opening for more nuclear power—but again, the war in 
Ukraine reminds us that it’s not only a matter of building 
more nuclear plants but also of reducing the risks they 
present during times of tension. A nuclear power plant 
is a dangerous military target. The attacks by Russia on 
Ukrainian nuclear facilities are alarming—they have the 
potential to scatter nuclear waste over large areas and 
even provide the ingredients for a 
dirty bomb. Nuclear power cannot 
provide a path to a more equitable 
and healthy energy future in a world 
of geopolitical conflict. Also, if a 
country can build a nuclear power 
plant, it can build a nuclear weapon. 
The scientific case for replacing 
fossil fuels with nuclear power is 
strong, but the political case gets 
much weaker in a hostile world. 

Last year, President Biden said 
that climate change was the great- 
est threat to national security. 
That is correct and a reassuring 
statement from a president when 
so many climate deniers vie for 
power in the United States and 
around the world. But the reverse is also true. A hostile 
world is the greatest threat to a generative policy for 
dealing with climate change. 

It’s time for US and European foreign policy to be 
reframed in the context of addressing climate change. 
People around the world must better understand that 
a healthier world where all forms of energy are appro- 
priately utilized requires a world that focuses first on 
peace. Without working toward peace as the first step, 
international efforts to tackle global climate change 
and promote renewable energy and sustainable devel- 
opment cannot progress. Global governance becomes 
more fragile, and as climate change worsens, it will 
threaten world stability, trigger humanitarian crises, 
and provoke more war, in a deadly cycle. 

To win the climate war, we must win the climate peace. 


-H. Holden Thorp 


H. Holden Thorp 
Editor-in-Chief, 
Science journals. 
hthorp@aaas.org: 
@hholdenthorp 
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494% 44 |'ve often heard folks say, ‘Let's get back to normal: 


Well, normal was not equitable. 99 


Higher education researcher Christa Porter, at a U.S. national science academies forum, 
yy, on the pandemic’s long-term impact on the careers of women scientists. 
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PALEONTOLOGY 


Stan the 
I. rex gets new home 


he Tyrannosaurus rex fossil known as Stan, was dug up and owned by the Black Hills Institute 
auctioned in October 2020 for a record of Geological Research until its auction. The private 
$31.8 million, will be housed in a new natu- buyer was anonymous, and scientists feared they 
ral history museum in Abu Dhabi, United wouldn’t be allowed to conduct further studies. But 
Arab Emirates, to be completed by 2025, the the museum will provide access, officials there told 
city’s Department of Culture and Tourism National Geographic, which first reported the new 


announced last week. The 67-million-year-old, 11.7- 
meter fossil (above) was found in 1999 in South 
Dakota and has generated many research papers 
because it’s well preserved and almost complete. It 


museum’s acquisition. Joining Stan will be a sam- 
ple from the Murchison meteorite, which landed in 
Australia more than 40 years ago and has advanced 
scientists’ understanding of the early Solar System. 


. . . areas must establish committees to review biophysicist He Jiankui produced the 
China seeks more ethics reviews and monitor research involving humans world’s first gene-edited babies, for which 
POLicy | China’s government last week and animals and ensure compliance with he was sentenced in 2019 to prison. The 
released guidance for strengthening the national and international ethical stan- guidelines are wide-ranging but rather 
ethical governance of research in life sci- dards. Many Chinese institutions already vague, says bioethicist Jing-Bao Nie of the 


ence, medicine, artificial intelligence, and have such boards, but national oversight 
other “sensitive” fields. Research institu- was lacking. China’s handling of such 


tions and companies working in these issues has been under scrutiny since reviews but don’t say how. Details are to 
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University of Otago, Dunedin. For example, 
they call for the public to participate in the 
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be worked out among the National Science 
and Technology Ethics Committee, national 
ministries, and local governments. 


FDA OKs second booster 


covip-19 | The U.S. Food and Drug 
Administration (FDA) this week autho- 
rized a second booster dose of COVID-19 
vaccines for adults ages 50 and older and 
for the immunocompromised ages 12 and 
older. The agency cited the waning effect of 
earlier doses and data it said revealed no 
new safety concerns. FDA will allow these 
groups, which are more vulnerable to poor 
outcomes from COVID-19 infections, to 
receive Pfizer’s and Moderna’s messenger 
RNA vaccines beginning 4 months after 
their first booster dose. Also this week, 

the U.S. Centers for Disease Control and 
Prevention issued guidance that members 
of these groups may receive an additional 
booster. FDA did not consult with a com- 
mittee of outside advisers before making 
the decision, which came after Israeli 
scientists last week posted a preprint ana- 
lyzing recent data from more than 563,000 
people ages 60 and older. It found that a 
second booster shot substantially reduced 
COVID-19 mortality. 


Microplastics in blood measured 


POLLUTION | Researchers have measured 
tiny particles known as microplastics 

in human blood, a first that could aid 
research on the particles’ health risks, 
which are unknown. The microscopic 
particles come from plastic degrading in 
the environment. People are exposed by 
breathing air and consuming food and 
drink laced with the microplastics, which 
have been found worldwide. Previously, 
they had been detected in human pla- 
centas and animal organs. But until now, 
measuring them in blood had proved 
technically difficult because of its complex 
mix of molecules. Heather Leslie, a chemist 
and ecotoxicologist at the Free University 
of Amsterdam, studied samples from 

22 volunteers with a mass spectrometer. 
Blood from only one volunteer had no 
detectable microplastics, the team reported 
last week in Environment International. 


WHO boosts traditional medicine 


BIOMEDICINE | The World Health 
Organization inked a deal with the Indian 
government last week to launch a global 
center for the study of traditional medi- 
cine, a field that WHO Director-General 
Tedros Adhanom Ghebreyesus predicts 
will be “a game changer for health when 
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founded on evidence, innovation, and 
sustainability.” Indian Prime Minister 
Narendra Modi, too, has strongly pro- 
moted traditional medicine, but scientists 
have blasted his government’s embrace 
of unproven therapies, including for 
COVID-19. The center, which opens on 25 
April, will be based in Jamnagar, a city 

in Gujarat state that is a bastion of the 
ayurvedic school of traditional Indian 
medicine. India will invest about $250 
million in the center to pay for land, a new 
building, and operating costs for 10 years. 


Gene therapy gel shows promise 


CLINICAL RESEARCH | A DNA-laden topi- 
cal skin gel has helped heal wounds from a 
rare inherited disorder in what is prov- 
ing to be the first clinical success for this 


ECOLOGY 


form of gene therapy. In a small study 
published this week in Nature Medicine, 
a team used the gel, which contains a 
modified herpesvirus carrying a gene for 
the protein collagen VII, to treat nine 
people with recessive dystrophic epider- 
molysis bullosa. Due to by a mutated 
version of the collagen gene, the disease 
causes their skin to tear easily, resulting 
in painful open wounds, persistent infec- 
tions, scars, and sometimes skin cancer. 
Repeated applications of the gel over 

25 days healed most patients’ wounds 
within 3 months after treatment; a 
wound treated with a placebo kept open- 
ing and closing. Trial sponsor Krystal 
Biotech Inc. reported positive results 
from a larger, 31-person trial at a meet- 
ing last week and plans to seek regulatory 
approval this year. 


Bleaching hits Great Barrier Reef again 


arm ocean temperatures have again caused mass bleaching of coral that 
makes up the famed Great Barrier Reef, for the fourth time since 2016. The 
damage is occurring in many locations along its 2000-kilometer length, 
Australia's Great Barrier Reef Marine Park Authority said last week. The 
increasing frequency of bleaching “decreases the ability of the coral commu- 


nity to bounce back,” says ecologist Kathy Townsend of the University of the Sunshine 
Coast, Fraser Coast. This year’s damage was unexpected because it has occurred 
during a La Nifia event, which typically brings cooler weather to the region. A UNESCO 
team has been visiting Australia to study whether to recommend that the Great Barrier 
Reef, which is a World Heritage Area, be formally classified as “in danger,’ a step that 
would raise pressure for Australia to take action to limit climate change. A year ago, the 
Australian government, concerned about the impact on tourism, successfully lobbied 
the World Heritage Committee to hold off making that designation. 


A Great Barrier Reef coral turned white last month, after warm ocean temperatures caused it 
to eject symbiotic algae living in its tissues. 
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PARTICLE PHYSICS 


U.S. pares back neutrino experiment to beat rival 


Decision to build troubled megaproject in two phases will add to ballooning cost 


By Adrian Cho 


truggling with cost overruns, the De- 

partment of Energy (DOE) has de- 

cided to build the United States’s next 

great particle physics experiment in 

two phases, officials told physicists 

this month. The decision means the 
megaproject—actually two intertwined ef- 
forts called the Long-Baseline Neutrino 
Facility (LBNF) and the Deep Underground 
Neutrino Experiment (DUNE)—won’t be 
completed to its original specs until the 
mid-2030s instead of late this decade, and 
that its already ballooning cost will increase 
even further. 

But the adjusted plan should get a 
slimmed version of LBNF/DUNE running 
as early as 2029, researchers say. With that 
timeline, the experiment still has a chance 
to beat a rival in Japan called Hyper- 
Kamiokande (Hyper-K) to a key neutrino 
measurement, says Jim Siegrist, DOE’s asso- 
ciate director for high energy physics, who 
laid the plan out by video on 16 March dur- 
ing a community planning exercise known 
as the Snowmass process. “DOE believes 
there is still a chance for LBNF/DUNE to 
come first,” Siegrist wrote in an email. 

Physicists involved in the project sug- 
gested the change in tack, says Regina 
Rameika, a neutrino physicist at Fermi Na- 
tional Accelerator Laboratory (Fermilab) 
and co-spokesperson for the 1400-member 
DUNE collaboration. “We are all on the 
same page about what comes first and what 
comes second,” she says. But the stretched- 


10 1 APRIL 2022 + VOL 376 ISSUE 6588 


out schedule presents a challenge, says Inés 
Gil-Botella, a DUNE physicist at Spain’s 
Center for Energy, Environmental, and 
Technological Research. “We need to work 
harder to reduce as much as possible any 
delays between phase one and phase two,” 
she says. The revised plan makes clear that 
the current $3 billion cost estimate—nearly 
twice the original—now applies to the first 
phase alone. 

Nearly massless and rarely interacting 
with other matter, neutrinos come in three 
types—electron, muon, and tau—depending 
on how they’e generated. One type can 
morph into another as the particles zip 
along at near-light-speed. To study those 
neutrino oscillations, physicists can fire a 
beam of muon neutrinos generated with 
a particle accelerator to a huge detector 
hundreds of kilometers away, which counts 
the arriving muon neutrinos—plus electron 
neutrinos that have appeared along the way. 

LBNF/DUNE would shoot an intense 
beam of muon neutrinos from Fermilab to a 
detector 1300 kilometers away in the aban- 
doned Homestake gold mine in Lead, South 
Dakota. Filled with frigid liquid argon, the 
novel DUNE detector would capture the 
collisions of neutrinos and atomic nuclei in 
unprecedented detail. Besides putting phys- 
icists’ theory of neutrino oscillations to the 
acid test, it would search for charge-parity 
(CP) violation, an asymmetry between the 
oscillations of neutrinos and antineutrinos 
that could help explain how the universe 
created more matter than antimatter. 

After squabbling for a decade, U.S. physi- 


More than 1 kilometer down in a former gold mine in 
South Dakota, a worker helps excavate tunnels and 
chambers for a massive neutrino detector. 


cists rallied around the experiment in 2013 
in the previous Snowmass. A year later, 
DOE's Particle Physics Project Prioritization 
Panel (P5) named it the new flagship domes- 
tic effort. In 2015, DOE embraced the P35 vi- 
sion: an internationally funded experiment 
that would have a liquid argon detector 
with a target mass of at least 40,000 tons, 
a neutrino beam powered by a multimega- 
watt proton beam, and a smaller detector at 
Fermilab to monitor the outgoing beam. It 
would cost less than $1.9 billion and be com- 
pleted by 2028, DOE estimated. But refur- 
bishment and excavation at the mine have 
helped drive up the cost and delay progress 
(Science, 24 September 2021, p. 1424). 

Now, Siegrist says the project will start 
smaller, with a detector in South Dakota 
containing just two of four planned argon 
modules, with a combined target mass of 
no more than 24,000 tons. The power of the 
proton beam at Fermilab will be limited to 
1.2 megawatts, and the detector at the lab 
will be simpler. After completing the first 
phase, DOE could pursue the bigger detec- 
tor and better beam in an upgrade likely 
costing hundreds of millions of dollars. 

With just that first phase, LBNF/DUNE 
should be capable of measuring CP viola- 
tion. But physicists with Japan’s Hyper-K 
say they’ll start to take data in 2028, a year 
before LBNF/DUNE. Hyper-K will consist 
of a subterranean cylindrical tank 70 me- 
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ters tall and filled with 260,000 tons of 
ultrapure water. Some 40,000 phototubes 
will capture the distinctive flashes of light 
generated when neutrinos fired from the 
Japan Proton Accelerator Research Com- 
plex in Tokai, 295 kilometers to the east, 
interact with the water. 

Workers in Japan have already begun ex- 
cavation, says Matthew Malek, a Hyper-K 
physicist at the University of Sheffield. Given 
Japan’s relatively conservative approach to 
scheduling, that date is likely to hold, he 
says. “At this stage, there’s really no chance 
that DUNE will turn on before Hyper-K,” he 
says. But Marvin Marshak, a DUNE physicist 
at the University of Minnesota, Twin Cit- 
ies, won’t concede the race ahead of time. 
“One thing I learned over the last 50-some 
years that I’ve been doing this is not to over- 
estimate your competition,” he says. 

Hyper-K and DUNE will also search for 
neutrinos from supernova explosions and 
signs that protons can decay. DUNE’s ulti- 
mate goal is to probe physicists’ model of 
three neutrino types with enough precision 
to tell whether it breaks down at some level, 
which could be proof of new particles and 
forces, Gil-Botella says. For technical reasons, 
Hyper-K will struggle to do so, she says. 

DOE has asked physicists to weigh the 
scientific case for the second phase in the 
new Snowmass process and subsequent P5 
prioritization, potentially giving skeptics 
a chance to question whether the whole 
project is worth the higher cost. Tao Han, a 
theorist at the University of Pittsburgh who 
has been involved in Snowmass organizing, 
predicts U.S. physicists will reiterate their 
support for it. “We will have the community 
support to reach our final goal,” he says. 

Malek wonders whether the new plan is 
DOE’s way of saying the first stage is all it 
will build. “When you build something in 
multiple stages, the odds of getting beyond 
the first stage are more unlikely than likely,” 
he says. Mary Bishai, a DUNE physicist at 
Brookhaven National Laboratory, reads 
things differently: “Jim Siegrist said ‘Go to 
Snowmass and come back with a plan for 
the third and fourth detector modules’ He 
didn’t say, ‘We’re done.” 

The issue comes at a time of flux for 
the U.S. program. Fermilab, the United 
States’s only dedicated particle physics lab, 
is searching for a new director after Nigel 
Lockyer announced in September 2021 he 
would be stepping down—3 months before 
DOE failed the lab in its annual perfor- 
mance review, citing in part the troubles 
with LBNF/DUNE. And on 31 March, 
Siegrist, 69, will retire after 10 years leading 
DOE’s high energy physics program. As U.S. 
particle physics enters rough waters, it’s un- 
clear who will take the helm. 
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U.S. FUNDING 


Biden’s 2023 budget request 
for science aims high—again 


New spending plan repeats many themes 
that Congress failed to support fully this year 


By Science News Staff 


resident Joe Biden didn’t forget re- 
search this week when he submitted 
to Congress a 2023 budget request 
that calls for a 9.5% increase in do- 
mestic discretionary spending. Biden 
is asking for a 19% increase at the 
National Science Foundation (NSF), a 9.3% 
boost for the National Institutes of Health 
(NIH), 4.5% more for the Department of 
Energy’s (DOE’s) Office of Science, and a 
5% hike for NASA’s science missions. Once 
again, fighting climate change and boosting 


Wish list 

Congress scaled back many of President Joe Biden's 
priorities in this year’s budget, but he has revived 

his bid for major new initiatives in health, technology, 
climate, and energy. 


2022FINAL 2023 % 
AGENCY (IN $ BILLIONS) REQUEST INCREASE 
National Institutes 46.178 50.453 9.3% 
of Health 
ARPA-H 1 5 400% 
National Science 8.838 10.492 19% 
Foundation 
Technology 0.37. 0.88 140% 
directorate 
STEM education 1006 =1.377 37% 
directorate 
NASA science 7614 =7.988 5% 
DOE Office 7475-7799 4% 
of Science 
ARPA-E 0.45 0.7 56% 
NIST 123 1468 38% 
S&T labs 0.85 0.975 15% 
U.S. Geological 1.393 1711 23% 
Survey 
NOAA 5.877 6.866 17% 
EPA science and 0.729 0.863 18% 
technology 
USDA Agriculture 0.445 0.564 27% 
and Food 


Research Initiative 


sustainable energy technologies also rank 
high among Biden’s research priorities. 

But as always when a president submits 
an annual budget, the hard part will be 
getting Congress to go along. That process 
usually runs past the 1 October start of the 
fiscal year, leading to a temporary freeze 
on spending at current levels. This year, 
with midterm elections in November that 
could shift control of one or both cham- 
bers from Democrats to Republicans, a fi- 
nal agreement could easily be delayed until 
next year. 

Even with his party now in control, 
Biden’s first budget blueprint for science 
was seriously downsized when Congress 
passed a final 2022 spending bill last 
month. For example, legislators shrank 
Biden’s proposed budget for a new Ad- 
vanced Research Projects Agency for 
Health (ARPA-H) from $6.5 billion to 
$1 billion, instead giving NIH’s existing in- 
stitutes a boost of 5%. But ARPA-H remains 
a presidential favorite, with Biden seeking 
$5 billion for it in 2023. 

Likewise, the final 2022 spending 
bill whittled his proposed increase for 
NSF from 20% to 4% and eliminated a 
$500 million request for a new technology 
directorate that would ramp up NSF’s ap- 
plied research efforts. But Biden has come 
back with a similar-size request for 2023, 
including a proposal to launch 10 mega- 
centers to boost regional innovation. 

The president’s 2023 budget does have 
some new wrinkles. In a bid for bipartisan 
support, Biden wants to boost military 
spending by 4%. (His 2022 budget request 
only boosted defense by 1.8%, a figure Con- 
gress tripled.) And unlike last year, when 
climbing out of a pandemic-caused reces- 
sion was his top priority, Biden says his 
2023 budget will reduce the federal deficit 
by raising taxes on the superwealthy and 
curbing overall spending. 

Here are some highlights from the 
budgets of four key agencies. 


NIH 

The biomedical research agency’s bud- 
get would rise $4.3 billion in 2023, to 
$50.4 billion. But nearly all the new fund- 
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ing would go to ARPA-H, which is intended 
to fund high-risk, cutting-edge research. 
The budgets for most of NIH’s 27 institutes 
and centers would remain flat compared 
with this year. 

“We're disappointed,” says Eleanor 
Dehoney, vice president of policy and advo- 
cacy for Research!America. A $275 million 
increase to NIH’s base budget, Dehoney 
says, “simply isn’t enough to meet the agen- 
cy’s mission.” 

Although the president’s budget places 
ARPA-H at NIH, many advocates would 
like it to be independent. Last month, 
Congress created it as a standalone agency 
within the Department of Health and Hu- 
man Services but gave HHS Secretary 
Xavier Becerra until this week to decide 
whether to move it to NIH. 

One of the few areas tagged for increases 
is NIH’s Office of Nutrition Research. 
Its budget would roughly double, to 
$195 million. NIH’s health disparities pro- 
gram would rise by 9%, to $4.4 billion. 
Biden has also requested $2 million for a 
Center for Sexual Orientation and Gender 
Identity Research. A recent National Acad- 
emies of Sciences, Engineering, and Medi- 
cine report recommended federal agencies 
do a better job of collecting those data. 

NIH would also receive $12.1 billion over 
5 years as part of an HHS plan to improve 
pandemic preparedness using “manda- 
tory” funds that do not require annual ap- 
proval from Congress. The money would 
fund research on vaccines, treatments, 
and diagnostics to address high-priority 
viruses, as well as biosecurity and clinical 
trial infrastructure. 


NSF 

Biden’s budget request for NSF sends a 
clear message to Congress: He believes in 
the agency. The $10.5 billion he has pro- 
posed for 2023 is consistent with aspira- 
tional spending levels in two pending bills 
intended to help the United States out- 
compete China, which the president has 
urged Congress to pass after reconciling 
their differences. 

His 2023 budget also revives NSF’s plan 
to spend $200 million on 10 regional in- 
novation engines. Twice the size of NSF’s 
current bevy of engineering and science 
centers, the new centers are designed 
not just to advance emerging technolo- 
gies, but also address the workforce and 
economic needs of various regions of the 
country. This year’s spending bill encour- 
ages NSF to start one center, from exist- 
ing funds, but NSF Director Sethuraman 
Panchanathan says achieving the presi- 
dent’s vision will require “tens of millions of 
dollars” invested in each of several centers. 
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Biden’s new budget boosts spending for 
science, technology, engineering, and math 
education at all levels. Notably, it would 
grow by more than one-third the annual 
size of NSF’s flagship graduate training 
program, increasing the number of new 
graduate research fellowships (GRFs) from 
2000 to 2750. It would also raise annual 
stipends for the 3-year GRFs from $34,000 
to $37,000. 


DOE 

The administration’s focus on climate 
shaped its request for DOE’s Office of Sci- 
ence, with the biological and environment 
research program receiving the biggest in- 
crease, a 10.9% boost, to $904 million. 

The rest of the 4.5% overall increase, 
to $7.8 billion, would be spread more or 
less evenly across the Office of Science’s 
other five major research programs. But 
Biden has also requested a 56% increase 
for DOE’s Advanced Research Projects 
Agency-Energy (ARPA-E), to $700 million, 
in line with the administration’s emphasis 
on developing clean energy technologies. 
The agency aims to rapidly turn basic re- 
search into prototype technologies that the 
private sector might take over. 

The request says ARPA-E will also ex- 
pand its remit “to invest in climate-related 
innovations necessary to achieve net zero 
climate-inducing emissions by 2050.” The 
Biden administration had wanted to create 
a separate entity, called Advanced Research 
Projects Agency-Climate, to advance that 
goal. But last month Congress directed DOE 
to give the task to ARPA-E. 


NASA 

The biggest increase within the $8 billion 
that Biden has requested for the agency’s 
science directorate is a 17% boost for the 
earth sciences program, to $2.4 billion. 
Within that, the budget for the Plankton, 
Aerosol, Cloud, Ocean Ecosystem (PACE) 
satellite to study marine plankton and 
clouds would grow by 38%, to $113 million, 
in preparation for a launch in May 2024. In 
contrast, planetary science, which jumped 
by 18% this year, would only increase by 
1%, to $3.2 billion. One small casualty is 
planning for a Mars Ice Mapper mission. 
Even so, the Mars Sample Return mission, 
an agency priority, would see a 26% in- 
crease, to $822 million. NASA also hopes 
to nearly double spending next year on 
Dragonfly, a helicopterlike craft intended 
to explore Titan, Saturn’s largest moon. 
The launch of the $850 million craft has 
been delayed a year, to 2027. 


With reporting by Adrian Cho, Jocelyn Kaiser, 
Jeffrey Mervis, and Erik Stokstad. 
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Dirty bomb 
ingredients go 
missing from 
Chornobyl lab 


Insecure radioactive materials 
are the latest worry as Russia 
continues occupation of 
infamous nuclear reservation 


By Richard Stone 


hen the lights went out at Chor- 

nobyl’s nuclear power plant on 

9 March, the Russian soldiers 

holding Ukrainian workers at gun- 

point became the least of Anatolii 

Nosovskyi’s worries. More urgent 
was the possibility of a radiation accident at 
the defunct plant. If emergency generators 
ran out of fuel, the ventilators that keep ex- 
plosive hydrogen gas from building up inside 
a spent nuclear fuel repository would quit 
working, says Nosovskyi, director of the In- 
stitute for Safety Problems of Nuclear Power 
Plants (ISPNPP) in Kyiv. So would sensors 
and automated systems to suppress radio- 
active dust inside a concrete “sarcophagus” 
that holds the unsettled remains of Chornob- 
yl’s Unit Four reactor, which melted down in 
the infamous 1986 accident. 

Although power was restored to Chor- 
nobyl on 14 March, Nosovskyi’s worries 
have multiplied. In the chaos of the Rus- 
sian advance, he told Science, looters raided 
a radiation monitoring lab in Chornobyl 
village—apparently making off with radio- 
active isotopes used to calibrate instruments 
and pieces of radioactive waste that could 
be mixed with conventional explosives to 
form a “dirty bomb” that would spread con- 
tamination over a wide area. ISPNPP has a 
separate lab in Chornobyl with even more 
dangerous materials: “powerful sources of 
gamma and neutron radiation” used to test 
devices, Nosovskyi says, as well as intensely 
radioactive samples of material leftover from 
the Unit Four meltdown. Nosovskyi has lost 
contact with the lab, he says, so “the fate of 
these sources is unknown to us.” 

The drama at Chornobyl began on 
24 February, the very first day of the inva- 
sion. At 5 a.m., as Russian troops poured 
across Ukraine’s border with Belarus—just 
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An arching steel structure protects the smoldering remains of Chornobyl’s Unit Four nuclear reactor, but it was not designed to withstand shelling. 


15 kilometers from Chornobyl—ISPNPP 
managers were ordered to evacuate most 
staff, who monitor the safety of the plant, 
provide technical support for decommis- 
sioning, and develop protocols for man- 
aging radioactive waste in the off-limits 
“exclusion zone” surrounding Chornobyl. 
Within 2 hours, 67 had cleared out; two 
who live in Chornobyl village stayed behind 
to keep an eye on the institute’s lab. “We’ve 
lost contact with these brave people,” says 
ISPNPP senior scientist Maxim Saveliev. 

By 5 p.m., Russian troops had taken con- 
trol of all Chornoby] facilities. A shift super- 
visor, Valentin Geiko, negotiated a deal under 
which the plant’s Ukrainian guards would 
disarm and the Russian soldiers would not 
interfere with civilian workers, Nosovskyi 
says. But for nearly a month, the soldiers for- 
bade a shift change—essentially holding the 
workers hostage—and confiscated their cell- 
phones. In a gesture of defiance, the workers 
played the Ukrainian national anthem every 
morning, cranking up the volume, Nosovskyi 
says. Last week, the occupiers finally allowed 
fresh staff to rotate in. But some captive 
workers chose to remain, he adds, “so as not 
to put at risk people who should come in 
their place.” 

Chornobyl is not the only Ukrainian nu- 
clear installation at risk in the war. “There 
have already been several close calls,’ says 
Rafael Mariano Grossi, director general of 
the International Atomic Energy Agency, 
who traveled to Kyiv this week to discuss 
bolstering nuclear security. On 4 March, 
Russian forces shelled the Zaporizhzhya 
nuclear power plant—fortunately missing its 
reactor halls. Two days later, a rocket attack 
damaged a research reactor used to generate 
neutrons for experiments at the Kharkiv In- 
stitute of Physics and Technology. Nosovskyi 
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labels the assaults as nothing short of state- 
sponsored “nuclear terrorism.” 

But Chornobyl has a unique set of radio- 
active hazards. On 11 March, wildfires ignited 
in the nearby forests, which harbor radio- 
isotopes that were disgorged in the accident 
and taken up by plants and fungi. Russian 
military activities have prevented firefighters 
from entering the exclusion zone, Nosovskyi 
says. The fires continue to burn and could 
grow more intense as the weather warms, he 
says, releasing radiation that could lead to 
“significant deterioration of the radiation sit- 
uation in Ukraine and throughout Europe.” 

So far, remote measurements suggest 
radioactive particle concentrations in the 
smoke do not pose a health hazard, Nosovskyi 
adds, but an automated radiation monitoring 
system that went down in the power outage 
has not yet been brought back online. That 
means “there is no information on the real 
situation in the exclusion zone,’ says Viktor 
Dolin, research director of the Institute for 
Environmental Geochemistry in Kyiv. 

The restoration of electricity averted the 
nightmare of a hydrogen explosion in the 
spent fuel repository, where 8500 tons of 
uranium fuel rods continue to cool off in 
pools of water. The repository poses a major 
radioactive threat: Through radioactive de- 
cay, the assemblies have accumulated about 
240 times more cesium-137 and 1500 times 
more strontium-90 than the destroyed reac- 
tor spewed in 1986, Dolin says. Staff intend 
to punch holes in the repository’s walls to al- 
low hydrogen gas to escape in the event of a 
future power outage, Nosovskyi says. 

Another menace at Chornobyl are the fuel- 
containing masses (FCMs)—a mix of fuel 
rods, zirconium cladding, and other materi- 
als that melted in the accident and continue 
to smolder under Unit Four’s sarcophagus, 


hastily erected in the wake of the disaster. 
For years Ukrainian scientists, with col- 
leagues from Russia’s Kurchatov institute, 
have kept a tense vigil. (The institute cut ties 
with Ukrainian partners earlier this month, 
issuing a statement supporting the war and 
the “denazification” of Ukraine.) Occasional 
spikes in the number of neutrons streaming 
from certain FCMs—a sign of fission—prompt 
sprinklers to spray gadolinium nitrate solu- 
tion, which absorbs neutrons. 

The odds of self-sustaining fission, or crit- 
icality, in an FCM are minuscule, and even 
if criticality triggered a small explosion, the 
burst would probably be contained within 
an arching steel structure, called the New 
Safe Confinement (NSC), that was erected 
over the sarcophagus in 2016 to shield it 
from the elements and create a safe space 
for cleanup work. But the NSC was not de- 
signed to withstand shelling, and a breach 
could disturb the FCMs. It could also re- 
lease some of the hundreds of tons of highly 
radioactive dust that have accumulated in 
the sarcophagus over the years as the FCMs 
gradually disintegrate. 

Thousands of other sites in Ukraine have 
radiological materials. Most are under the 
watchful eye of Ukraine’s nuclear regulator. 
“There’s a lot of ongoing effort to secure ma- 
terial,” says Peter Martin, a nuclear physicist 
at the University of Bristol who collaborates 
with scientists at Chornobyl. That means, 
where possible, moving sources into vaults 
and repositories. But Vitaly Fedchenko, a nu- 
clear security expert at the Stockholm Inter- 
national Peace Research Institute, notes that 
Ukraine, like other parts of the former Soviet 
Union, has not kept track of all the Soviet 
nuclear legacy. “There are a lot of radio- 
active sources that are not on anyone’s ra- 
dar,’ he says. “Even Ukraine’s radar.” & 
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Japan reboots HPV vaccination 
drive after 9-year gap 


Modeling suggests 2013 move that undercut immunization 
will lead to thousands of preventable deaths 


By Dennis Normile 


ine years ago, Japan’s health min- 

istry made what many scientists re- 

garded as a terrible mistake. Pressed 

by antivaccine activists who claimed 

debilitating side effects, it stopped 

recommending that Japanese girls 
get a vaccine that helps prevent cervical 
cancer. Now, in what public health officials 
say is a collateral benefit of the success of 
COVID-19 vaccines, the ministry has finally 
reversed its position. On 1 April, it will re- 
sume recommending that girls ages 12 to 
16 get vaccinated against human papilloma- 
virus (HPV)—“an important signal of con- 
fidence in the vaccine and its safety,’ says 
Paul Bloem, HPV vaccine strategy lead at the 
World Health Organization (WHO). 

But part of the damage can’t be undone. 
A modeling study published in The Lancet 
in 2020 estimated that the negligible vac- 
cination rate between 2013 and 2019 would 
result in 25,000 preventable cervical cancer 
cases and up to 5700 deaths over time. A 
rapid catch-up campaign for the millions of 
women who missed their shots—which the 
government has now pledged to undertake— 
would only prevent 60% of that toll, the 
study said, because many of the women have 
already been infected with HPV. The disease 
causes nearly 3000 deaths annually in Japan, 


14 1 APRIL 2022 + VOL 376 ISSUE 6588 


in part because cervical cancer screening 
rates are low. 

Japan initially embraced HPV vaccines, 
approving GlaxoSmithKline’s bivalent shot— 
which protects against the two HPV types 
carrying the greatest cancer risk—in 2009, 
and Merck & Co?’s quadrivalent vaccine in 
2011. In April 2013, the health ministry added 
both to the national immunization program 
and started to recommend vaccination. 

But just 10 weeks later, an advisory panel 
suggested suspending the recommendation 
after a number of girls reported chronic 
pain, headaches, motor impairment, and 
other symptoms after immunization. The 
ministry complied, and the vaccination rate 
plummeted from about 70% to less than 1% 
of those eligible. 

Such safety problems had not emerged 
in clinical trials, and in 2017, WHO’s Global 
Advisory Committee on Vaccine Safety said 
an extensive review of studies from around 
the world indicated the vaccines were “ex- 
tremely safe.” In Japan, a nationwide survey 
that same year found unvaccinated girls suf- 
fer the symptoms attributed to the vaccines 
at similar rates as vaccine recipients. 

Evidence for effectiveness grew as well. 
The vaccines were approved because they 
prevent HPV infection, but by the late 2000s, 
studies showed they reduced the incidence of 
precancerous lesions as well. Large studies in 


Japan's government again recommends vaccination 
against human papillomavirus for girls ages 12 to 16. 


Sweden and England, reported in 2020 and 
2021, respectively, showed vaccination in the 
early teen years cut the risk of cervical cancer 
by age 30 by 87% to 88%. 

In other countries roiled by reports of 
side effects—including Denmark, the United 
Kingdom, and Colombia—authorities kept 
recommending the vaccines while investigat- 
ing the claims, says Heidi Larson, head of the 
Vaccine Confidence Project at the London 
School of Hygiene & Tropical Medicine. Vac- 
cination rates dipped, but quickly recovered. 
But in Japan, the government was slow to 
review the evidence while antivaccine pres- 
sure “just became louder and louder,’ says 
Sharon Hanley, a cancer epidemiologist at 
Hokkaido University. Opponents held press 
conferences, seminars, and demonstrations, 
and more than 100 women and girls joined 
lawsuits against the health ministry and vac- 
cine manufacturers. 

Still, calls to reverse the policy increased. 
In 2017, 17 Japanese academic societies 
urged the ministry to resume support for 
vaccination. In 2020 and 2021, a group of 
parliamentarians led by cervical cancer sur- 
vivor Junko Mihara asked the health minis- 
try to reconsider its position. The COVID-19 
pandemic demonstrated the power of vac- 
cines to reduce severe illness and deaths, 
which eventually tipped the scales for the 
HPV vaccines as well, Hanley says. “Anti- 
vaccine rhetoric was also not given much 
space in the media” during the pandemic, 
she says. (About 80% of Japan’s population 
is fully vaccinated against COVID-19.) In Oc- 
tober 2021, the health ministry’s advisory 
committee said there was no reason not to 
restart recommending HPV vaccination. 

Many parents are still wary, and local gov- 
ernments and health care providers will have 
to convince them of the vaccines’ benefits. 
The shots are in short supply globally, par- 
ticularly Merck’s latest, also approved in Ja- 
pan, which protects against nine HPV types. 
And activists are not giving up. Resuming 
proactive recommendation “is without any 
scientific basis and is wrong as public health 
policy,’ says Masumi Mizuguchi, a lawyer 
representing plaintiffs suing the govern- 
ment. The lawsuits are working their way 
through Japan’s legal system and will con- 
tinue to generate publicity. 

But vaccination supporters believe the 
spell may have been broken. “I am confi- 
dent coverage will resume to previous lev- 
els as quickly as it fell simply due to peer 
power,’ Hanley says. To her, the 9-year in- 
terlude holds an important lesson: “When 
the government is not supporting [a vac- 
cine], then the people won’t support it.” 
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GENOMICS 


Most complete human genome yet is revealed 


Unusual cell line helps sequencers read previously indecipherable stretches of DNA 


By Elizabeth Pennisi 


hen it comes to sequencing the 

human genome, “complete” has 

always been a relative term. The 

first one, deciphered 20 years ago, 

included most of the regions that 

code for proteins but left about 
200 million bases of DNA—8% of the human 
genome—untouched. Even as additional 
genomes were “finished,” some stretches 
remained out of reach, because repetitive 
segments of DNA confounded the sequenc- 
ing technologies of the time. 
Now, an international grass- 
roots effort has sorted out those 
hard-to-read bases, producing 
the most complete human ge- 
nome yet. 

In six papers starting on 
p. 42, the Telomere-to-Telomere 
(T2T) Consortium—named for 
the chromosomes’ end caps— 
fills in all but five of the hun- 
dreds of remaining problem 
spots, leaving just 10 million 
bases and the Y chromosome 
only roughly known. And on 
31 March, the T2T consortium 
announced in a tweet it had de- 
posited a correct sequence as- 
sembly of the missing Y. 

“T don’t think we could have 
imagined this even 5 years ago, 
certainly not 10 years ago,” says 
bioinformaticist Ewan Birney, 
deputy director of the European Molecular 
Biology Laboratory and part of the origi- 
nal Human Genome Project “It’s a tour de 
force.’ T2T researchers say the newly se- 
quenced stretches reveal hotspots for gene 
evolution and underscore the chaotic his- 
tory of the human genome. It “really gives 
us some insight into regions of the genome 
that have been invisible,’ says Deanna 
Church, a genomicist at Inscripta, a gene- 
editing company. 

The previously indecipherable sequences 
of the genome that have now come into 
clear view include the protective telomeres 
and the dense knobs called centromeres, 
which typically reside in the middle of each 
chromosome and help orchestrate its rep- 
lication. Also almost completely revealed 
are the short arms of the five chromosomes 
where centromeres are skewed toward one 
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end. Those short arms were known to con- 
tain scores of genes coding for the backbone 
of ribosomes, the cell’s protein factories. 
When Birney, Church, and their col- 
leagues introduced that first draft of a hu- 
man genome in 2001, and even after they 
“completed” and published it in 2004, se- 
quencer machines and genome assembly 
software could not wade through areas 
where the DNA sequence contained very 
repetitive stretches of bases: The repeats 
could too easily be skipped or their bases 
linked together incorrectly. As sequencing 


Surprises awaited in the short arms (green) and centromeres (pink) 
of newly deciphered human chromosomes. 


technology got better and costs dropped, 
scientists reduced the number of gaps or 
misassembled sequences, culminating in 
2017 with the release of a human genome 
called GRCh38. With less than 1000 gaps, 
it became for many the “reference” against 
which other human genomes are compared. 

But Karen Miga and Adam Phillippy 
wanted to do better. Miga, a geneticist at 
the University of California, Santa Cruz, 
yearned to learn the exact sequences of 
the distinctive “satellite’ DNA that helps 
form centromeres. Meanwhile, Phillippy, 
a bioinformatician at the National Human 
Genome Research Institute, was busy har- 
nessing new sequencing technologies that 
could read very long stretches of DNA, re- 
ducing the need to piece together shorter 
sequences. After meeting at a conference, 
they joined forces. Then in 2019, Phillippy 


reported they had succeeded in sequencing 
the X chromosome from end to end, inspir- 
ing dozens of other researchers to join the 
cause. “It really took on a life of its own,” 
Miga says. 

To simplify the task, they decided to 
use an anonymized cell line that was de- 
rived more than 20 years ago from an un- 
usual growth excised from the uterus of a 
woman—a failed pregnancy called a mole, 
produced when a sperm entered an egg that 
lacked its own set of chromosomes. With 
just the sperm’s genetic material, such eggs 
can’t develop into an embryo, 
but they can still replicate, es- 
pecially if the sperm delivers 
an X instead of Y chromosome. 
In a boon for the project, both 
members of the resulting cell 
line’s 23 pairs of chromosomes 
are identical. That “made a big 
difference” for eliminating gaps 
because sequencers didn’t have 
to resolve differences between 
the parents’ chromosomes, says 
Robert Waterston, a geneticist 
at the University of Washing- 
ton, Seattle, who helped lead 
the Human Genome Project. 

The T2T group combined 
sequencing technologies, in- 
cluding a so-called nanopore 
device that could read 100,000 
bases at a time and another se- 
quencer that was more accurate 
but only did about 10,000 bases 
at once. A final improvement to the latter 
method boosted accuracy, and together the 
three approaches were able to polish off all 
but five of the final trouble spots. “Just see- 
ing the multiple ways they went after this 
[shows] these are really hard problems,” 
Waterston says. 

The approximately 200 million bases 
finally in the right order and in the right 
place include more than 1900 genes, most of 
them copies of known genes. The research- 
ers cataloged duplicated regions and mobile 
elements—genetic material from viruses that 
became incorporated into the genome. In se- 
quencing each centromere, they learned the 
duplicated regions vary greatly in size, un- 
expected because these knobs serve the same 
purpose in each chromosome. 

The short chromosome arms held an- 
other surprise. As expected, they included 
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multiple copies, 400 in all, of the genes 
coding for the RNA that’s used to make ri- 
bosomes. “This rDNA was the last domino 
to fall,” as it was the hardest to sequence, 
Miga says. 

The short arms are also “just chock-full 
of [other] repeats,” says Jennifer Gerton, a 
chromosome biologist at the Stowers Insti- 
tute for Medical Research. Those include 
mobile elements, duplicated segments and 
other types of repetitive DNA, as well as 
many copies of genes from other parts of 
the genome. “It’s amazing how dynamic 
the human genome can be,” Church says. 
In five spots along these chromosomes, 
the resulting jumble is so long that the re- 
searchers still can’t clearly determine the 
order of the bases, although they have a 
rough idea of the sequence, Gerton says. 

Short arms are likely hotspots for gene 
evolution, Phillippy notes, as gene copies 
parked there are free to mutate and take on 
new functions. The catalog of duplications 
could also shed light on neurological and 
developmental disorders, which have been 
linked to variations in the number of copies 
of specific sequences. Chemical modifica- 
tions to the DNA in the complex repetitive 
areas likely play a role in disease as well, 
and those changes have been mapped. 

Because the cell line used lacked a Y chro- 
mosome, the T2T group sequenced one from 
a well-studied genome belonging to Harvard 
University systems biologist Leonid Peshkin. 
“Tm excited to be part of this front-line, lead- 
ing edge in science,” says Peshkin, whose se- 
quence will be described in a future paper. 

Despite their latest milestone, human 
genome sequencers aren’t packing their 
bags. “There’s still some work to do,” 
says Human Genome Project co-leader 
Richard Gibbs, a geneticist at Baylor Col- 
lege of Medicine. He and other research- 
ers stress that the field now needs to get 
similarly complete genome sequences from 
a greater diversity of people to look for 
variation in the short arms and the other 
tough-to-read regions, which could play a 
role in diseases or traits. 

The T2T team has made a start by de- 
ciphering 70 more genomes, with a goal 
of 350 from people of diverse ancestries. 
These genomes, sequenced as part of the 
Human Pangenome Reference Consor- 
tium, are more challenging to finish be- 
cause they don’t have identical pairs of 
chromosomes. So, for now, the team has 
settled for high-quality genomes that place 
as many of the bases as possible on their 
correct chromosomes. Next, the research- 
ers plan to apply all their methods to 
Peshkin’s whole genome. And, eventually, 
Phillippy says, “We want every genome to 
be telomere to telomere.” & 
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Genital defects seen in sons of 
men taking major diabetes drug 


Large study hints that taking metformin before conception 
could raise risk of birth defects by affecting sperm 


By Meredith Wadman 


etformin, a first-line diabetes drug 
used for decades, may boost the 
risk of birth defects in the offspring 
of men who took it during sperm 
development, according to a large 
Danish study. Sons born to those 
men were more than three times as likely 
to have a genital birth defect as unexposed 
babies, according to the paper, published in 
the Annals of Internal Medicine this week. 

The genital defects, such as hypospadias, 
when the urethra does not exit the tip of 
the penis, were relatively rare, occurring in 
0.9% of all babies whose biological fathers 
took metformin in the 3 months before 
conception. But epidemiologists say the 
findings are important because tens of mil- 
lions of people worldwide take metformin, 
chiefly for type 2 diabetes. 

“When I saw the paper ... I thought: 
‘Yup, this is gonna go viral;” says Germaine 
Buck Louis, a reproductive epidemiologist 
at George Mason University who wrote an 
editorial accompanying the report. “[Met- 
formin] is widely used even by young men 
because of the obesity issue that we have. So 


that is potentially a huge source of exposure 
for the next generation.” 

However, Buck Louis and every other sci- 
entist interviewed stressed that the paper’s 
findings are preliminary and observational 
and need to be corroborated. They cau- 
tioned men against abruptly stopping met- 
formin before trying to conceive. 

“Metformin is a safe drug, it’s cheap, and 
it does what it needs to do” by controlling 
blood sugar levels, says the paper’s first 
author, Maarten Wensink, an epidemio- 
logist and biostatistician at the University 
of Southern Denmark. Any change in medi- 
cation “is a complex decision that [a couple] 
should take together with their physicians.” 

Use of metformin, a synthetic compound 
that lowers blood sugar by boosting insulin 
sensitivity, has skyrocketed with the obesity 
epidemic and attendant diagnoses of type 
2 diabetes. In the United States in 2004, 
41 million prescriptions for the drug were 
written; by 2019 that number was 86 million. 

The drug has been in use since the 1950s, 
but this is the first large study to rigorously 
analyze any paternally mediated impact on 
human birth defects. Metformin’s use skews 
toward older people, but the rise in diabe- 


Anurse measures blood sugar levels in a man with diabetes in Paris. 
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tes means more younger men are taking it. 
In the United States, prescriptions to 18- to 
49-year-olds with type 2 diabetes grew from 
fewer than 2200 in 2000 to 768,000 in 2015. 

The researchers analyzed records from 
more than 1.1 million babies born in Den- 
mark between 1997 and 2016, using the 
country’s comprehensive medical regis- 
tries to connect data on births, metformin 
prescriptions, and birth defects. In the 
1451 offspring of men who filled metformin 
prescriptions during the 90 days before 
conception, the period when sperm are be- 
ing made, the team found 5.2% had birth 
defects, compared with 3.3% among un- 
exposed babies. That translated to a 1.4-fold 
increase of at least one major birth defect, 
including genital, digestive, urinary, and 
heart defects, after adjustments for paternal 
age and other factors. 

For genital defects alone, the increased 
risk—only seen in male infants—was much 
larger. Among exposed babies, 0.9% had 
genital defects, compared with 0.24% in 
unexposed babies. The numbers were 
small—13 metformin-exposed boys were 
born with genital defects. But after the re- 
searchers adjusted for factors including 
parental ages and maternal smoking status, 
they found a 3.39-fold rise in the odds of a 
genital defect. “The rate per se was surpris- 
ingly high,” Wensink says. 

Reassuringly, the researchers saw no ef- 
fect in offspring of men who took the drug 
earlier in life or in the year before or after 
the 90-day window of sperm production. “It 
really has to do with taking it in that win- 
dow when the sperm ... is being developed,” 
says senior author Michael Eisenberg, a 
urologist at Stanford Medicine. 

The team also found no additional risk in 
unexposed siblings of metformin-exposed 
babies, or in infants of diabetic fathers who 
took insulin or were not on metformin. All 
those findings suggest it’s the drug’s impact 
on sperm formation, rather than diabetes 
itself or another factor, that’s responsible. 

But the researchers acknowledge that 
men with diabetes who took metformin and 
those who didn’t may have differed in at- 
tributes such as obesity or how well their 
disease was controlled—data that were not 
accessible to the researchers. 

Nor are scientists sure how the drug may 
be impacting sperm. Studies in fish and 
mice suggest metformin can disrupt the de- 
velopment of male reproductive organs. 

The caveats make scientists cautious 
about drawing conclusions from the paper. 
“This paper is the first word, not the last 
word,” says Russell Kirby, a birth defects 
epidemiologist at the University of South 
Florida. “It’s definitely going to require ad- 
ditional research.” 
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Software-designed miniproteins 
could create new class of drugs 


Small versions of antibodies bind to virtually any target protein 


By Robert F. Service 


cientists have built their own “mini- 

antibodies” using innovative software 

that predicts how proteins fold. De- 

signed to home in on other proteins, 

the tailor-made molecules could lead 

to a new class of drugs to fight every- 
thing from cancer to COVID-19. 

“It’s pretty amazing stuff, says Steven 
Mayo, a chemist at the California Institute 
of Technology who wasn’t involved in the 
study. The strategy could also produce di- 
agnostic probes that could detect diseases 
early, adds Tanja Kortemme, a bioengineer at 
the University of California, San 
Francisco, also not involved. “It 
opens up many possibilities.” 

Because antibodies excel at 
binding to proteins, such as 
those on invading microbes, 
they can serve as drugs to bat- 
tle infections and cancer. But 
antibodies are large proteins, 
costly and often unstable. Ar- 
tificial, miniature versions of 
these binders could be cheaper 
and more stable. But designing 
them to bind specific targets 
has been difficult. 

When antibodies bind to a 
protein, it’s like a rock climber 
trying to scale a sheer cliff face, 
explains David Baker, a computational struc- 
tural biologist at the University of Wash- 
ington (UW), Seattle: The surface is mostly 
smooth with few reliable hand- and foot- 
holds. Antibodies are large, so they can snag 
many weak holds simultaneously, which col- 
lectively allow them to hold firm to a target. 

Miniproteins have fewer options. Research- 
ers have tried to overcome this deficiency by 
identifying hot spots on proteins—strong 
handholds—and then building miniproteins 
around those holds. That only works for a 
handful of targets studied enough that their 
hot spots have been mapped out, however. 

To get around this problem, Baker and his 
colleagues turned to Rosetta, software they 
designed to predict protein structures based 
on their amino acid sequence (Science, 8 Au- 
gust 2008, p. 784). To find potential hand- 
holds, the team fed Rosetta experimental 
maps of target proteins and then instructed 


Aminiprotein (dark 
gold) binds influenza’s 
hemagglutinin protein. 


the program to first calculate how tightly 
specific amino acids would bind to differ- 
ent spots across the surface of a target pro- 
tein. The software then looked for clusters of 
neighboring handholds and determined how 
to build a stable miniprotein that would grab 
onto as many holds in a cluster as possible. 
The researchers used the software to design 
tens of thousands of virtual miniproteins ex- 
pected to bind well to their targets. 

Baker and his colleagues tested their 
strategy with miniproteins designed to bind 
12 targets, including proteins involved in 
cancer and the surface proteins of viruses 
such as influenza and SARS-CoV-2. Yeast 
turned these virtual binders 
into actual miniproteins that 
could be evaluated in the lab. 
After tweaking their designs, 
the researchers wound up 
with miniproteins that bound 
as strongly as, if not stronger 
than, most antibodies do, they 
reported last week in Nature. 

Many of the targets play 
roles in disease, suggesting 
the minibinders could have 
potential as medicines. In July 
2021, for example, Baker and 
colleagues reported an early ex- 
ample in a preprint. They pro- 
duced miniproteins that bound 
to and neutralized the spike 
protein of SARS-CoV-2, the virus that causes 
COVID-19. The compounds were then 
shown to protect mice genetically engi- 
neered to be susceptible to SARS-CoV-2 in- 
fection. Clinical trials may begin this year, 
Baker says. 

“We can design proteins to bind to any 
target,’ says co-author Longxing Cao of 
Westlake University. The resulting mini- 
proteins could not only block disease pro- 
teins, he says, but also direct toxic payloads 
to cancerous tissues or ferry light-emitting 
or radioactive diagnostic molecules to can- 
cer or other disease cells, to indicate the 
progress of treatment. 

Still, Baker says, “There is room for im- 
provement.” Despite Rosetta’s predictions, 
many miniproteins didn’t stick to their tar- 
get. But the software will continue to make 
adjustments, he says, so its designs are sure 
to improve over time. 
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Scientists are probing the secrets 
of the inner core—and learning how 
it might have saved life on Earth 


By Paul Voosen 


arth’s magnetic field, nearly as old 
as the planet itself, protects life 
from damaging space radiation. But 
565 million years ago, the field was 
sputtering, dropping to 10% of to- 
day’s strength, according to a recent 
discovery. Then, almost miraculously, 
over the course of just a few tens 
of millions of years, it regained its 
strength—just in time for the sudden profu- 
sion of complex multicellular life known as 


the Cambrian explosion. 
What could have caused the rapid re- 


vival? Increasingly, scientists believe it 
was the birth of Earth’s inner core, a 
sphere of solid iron that sits within the 
molten outer core, where churning metal 
generates the planet’s magnetic field. 
Once the inner core was born, possibly 
4 billion years after the planet itself, its 
treelike growth—accreting a few milli- 
meters per year at its surface—would have 
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turbocharged motions in the outer core, re- 
viving the faltering magnetic field and re- 
newing the protective shield for life. “The 
inner core regenerated Earth’s magnetic 
field at a really interesting time in evolu- 
tion,” says John Tarduno, a geophysicist at 
the University of Rochester. “What would 
have happened if it didn’t form?” 

Just why and how the inner core was born 
at that moment is one of many lingering puz- 
zles about the Pluto-size orb 5000 kilometers 
underfoot. “The inner core is a planet 
within a planet,’ says Hrvoje Tkalci¢c, a 
seismologist at Australian National Uni- 
versity (ANU)—with its own topography, 
its own spin rate, its own structure. “It’s 
beneath our feet and yet we still don’t un- 
derstand some big questions,” Tkalci¢ says. 

But researchers are beginning to chip 
away at those questions. Using the rare 
seismic waves from earthquakes or nuclear 
tests that penetrate or reflect off the inner 


core, seismologists have discovered it spins 
independently from the rest of the planet. 
Armed with complex computer models, 
theorists have predicted the structure and 
weird behavior of iron alloys crushed by 
the weight of the world. And experimental- 
ists are close to confirming some of those 
predictions in the lab by re-creating the 
extreme temperatures and pressures of the 
inner core. 

Arwen Deuss, a geophysicist at Utrecht 
University, feels a sense of anticipation that 
may resemble the mood in the 1960s, when 
researchers were observing seafloor spread- 
ing and on the cusp of discovering plate tec- 
tonics, the theory that makes sense of Earth’s 
surface. “We have all these observations 
now,’ she says. It’s simply a matter of putting 
them all together. 


THE ANCIENTS thought Earth’s center was 
hollow: the home of Hades or hellfire, or 
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a realm of tunnels that heated ocean wa- 
ters. Later, following erroneous density 
estimates of the Moon and Earth by Isaac 
Newton, Edmond Halley suggested in 1686 
that Earth was a series of nested shells sur- 
rounding a spinning sphere that drove the 
magnetism witnessed at the surface. 

Basic tenets of planet formation pro- 
vided a more realistic picture. Some 
4.5 billion years ago, Earth was likely born 
from the collisions of many asteroidlike 
“planetesimals.”” The dense iron in the 
planetesimals would have sunk to the core 
of the molten proto-Earth, while lighter 
silicate rocks rose like oil on water to form 
the mantle. At temperatures of thousands 
of degrees and millions of atmospheres of 
pressure, the core would have remained 
molten, even as Earth’s mantle and crust 
cooled and hardened. 

Early 20th century seismologists con- 
firmed that view with a key bit of evidence: an 
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earthquake shadow. When an earthquake 
strikes, the rupture emits primary, or pres- 
sure, waves (P waves) that ripple out in 
all directions. Secondary, or shear, waves 
(S waves) follow. For large earthquakes, 
seismologists were able to detect P waves on 
the other side of the planet, after they were 
bent and refracted by Earth’s interior layers. 
But strangely, S waves were missing. That 
only made sense if the iron core was liquid, 
because liquids lack the rigidity that allows S 
waves to sashay through. 

It wasn’t until the early 1930s that Inge 
Lehmann, a pioneering Danish seismo- 
logist, noticed another breed of P waves 
that showed the core was not entirely liq- 
uid. These waves arrived at angles that 
were only possible if they had bounced 
off something dense. By 1936 she had de- 
duced the existence of a solid inner core, 
ultimately measured to be about 2440 kilo- 
meters in diameter: the planet inside. 


THE SOUTH SANDWICH ISLANDS are _in- 
hospitable volcanic crags in the far southern 
Atlantic Ocean. They are also earthquake 
factories, thanks to the nearby subduction of 
the South American tectonic plate. Seismo- 
logists like them for another, geometric rea- 
son: Earthquake waves that rocket from the 
islands to a lonely seismic station in Alaska 
shoot straight through the inner core. 

Nearly 30 years ago, Xiaodong Song and 
Paul Richards—both seismologists then at 
Columbia University—thought they could 
use those waves to get a handle on the 
spin of the inner core, which, suspended 
in liquid, is under no obligation to rotate 
in sync with the rest of the planet. Comb- 
ing through archival seismic records, they 
looked for subtle variations in the travel 
times of P waves for several dozen South 
Sandwich earthquakes over the course of 
decades. Their travel times through the 
outer core and mantle stayed constant, as 
expected. But with each passing year, P 
waves going through the inner core sped 
up a bit. “It was delicate, but you could see 
the changes,” Song says. 

There was only one way he and Richards 
could account for this puzzling trend: The in- 
ner core was rotating faster than the rest of 
the planet, by about 1° per year. This super- 
rotation was gradually realigning the seismic 
wave paths with a north-south axis in the in- 
ner core known to boost P wave speeds. Every 
400 years, they suggested in a 1996 Nature 
paper, the inner core made an extra revolu- 
tion inside Earth. 

A few years later, John Vidale, a seismo- 
logist now at the University of Southern Cali- 
fornia, validated the result using a slightly 
different method. Vidale specializes in us- 
ing records from the Large Aperture Seis- 
mic Array (LASA), a U.S. Air Force facility in 
Montana, closed in 1978, that operated more 
than 500 sensors in deep boreholes to detect 
atomic bomb tests. “It’s still the best data, 
better than the best arrays today,’ he says. 
Seismic waves from nuclear tests were ideal 
because, unlike earthquakes, the source can 
be precisely located. 

Vidale used the waves from two Soviet 
underground bomb tests detonated in 1971 
and 1974 beneath Novaya Zemlya, a remote 
Arctic archipelago. Instead of looking for 
waves that passed through the inner core, 
as Song and Richards did, Vidale chose ones 
that ricocheted off it, registering its spin like 
the beam of a radar gun. “We could see one 
side of the inner core getting closer, and one 
side getting further away,’ he says. 

He found that over the 3 years between 
the tests, the inner core rotated 0.15° per year 
faster than the rest of the planet—much less 
than Song’s first estimate. But subsequent 
work by Song in 2005, using 18 pairs of South 
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lron heart 

Earth's solid inner core, buried 
5000 kilometers below our 
feet, has remained enigmatic 
since its discovery nearly 

100 years ago. About the size 
of Pluto and growing several 
millimeters every year, it helps 
power Earth’s magnetic field. 
It also possesses a strange 
interior structure that is only 
now coming into view with 
advances in seismology. 


A mysterious reflection 

Earthquake pressure (P) and shear (S) waves refract 
as they pass through Earth, but the liquid outer core 
stymies S waves. In 1936, Inge Lehmann discovered 
P waves in a shadow zone associated with an 
entirely molten core—only possible if the waves were 
reflections off a solid sphere rather than refractions. 


Lehmann 
P wave 


P wave Earthquake 


S wave 


Outer Mantle 


P wave 
shadow zone 


S wave 
shadow zone 


Sandwich earthquakes that repeated in the 
same spot over the span of decades, lined up 
with Vidale’s reduced estimate. 

The discovery of the inner core’s super- 
rotation shocked many geophysicists, who 
had assumed it spun at the same rate as the 
mantle. It also tantalized them. The rota- 
tion could offer clues to how the inner core 
couples to the outer core and influences the 
magnetic dynamo. Some thought it could 
even help explain why Earth’s magnetic poles 
wander and flip from time to time. 

But almost as rapidly as this picture of 
the inner core’s spin emerged, it grew more 
complicated and more mysterious. “What we 
thought 10 years ago isn’t holding together,’ 
Vidale says. 


RECENTLY, SONG, now at Peking University, 
decided to revisit his rotation work. His post- 
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Magnetic driver 

Earth's magnetic field, which protects life from 
radiation, is driven by convective motions in 

the molten outer core. Growth of the inner core 
turbocharges those motions. As iron crystallizes, it 
spits out light elements like oxygen or silicon, which 
rise toward the mantle, dragging iron with them. 


Magnetic —* Rotation 


nvection 


doc, Yi Yang, compiled the world’s most ex- 
tensive database of repeating earthquakes, 
with sources not just in the South Sandwich 
Islands, but also in places like Chile and Ka- 
zakhstan. Analyzing more than 500 source- 
detector pairs with a range of paths through 
the core, Song and Yi found that the super- 
rotation stopped all at once a decade ago, and 
since then the inner core has rotated at the 
same speed as the mantle. The changes “all 
disappear at the same time,” says Song, who 
presented the work at a meeting of the Amer- 
ican Geophysical Union (AGU) late last year. 

Meanwhile, Vidale was trying to push 
his trend further back in time using LASA 
data. He focused on two bomb-induced 
earthquakes, both set off by the U.S. gov- 
ernment underneath the far end of Alaska’s 
Aleutian Islands, in 1969 and 1971. The tests 
were controversial; the second, Cannikin, at 


Crust 
Life sits on a layer of rock that is vanishingly thin 
compared with the rest of the planet. 


Mantle 
Earth's thickest layer is made of 3000 kilometers 
of sticky silicate rock. 


Outer core 
The molten iron outer core was born along with 
Earth 4.5 billion years ago. 


Inner core 
At 6000°C and 3 million atmospheres of pressure, 
the inner core is solid iron but soft. 


Innermost inner core 
At the core’s center is an off-center globe with odd 
seismic characteristics. 


Erratic spinner 

Waves from repeating earthquakes and nuclear 
tests have shown that the inner core does not 
rotate in sync with the rest of the planet. Some 
researchers believe gravitational tugs from 
dense blobs at the bottom of the mantle could 
be responsible for the erratic spinning. 


"eee Nuclear test 


Repeater 
earthquake 


Inner core 
Spin 


Detector 


5 megatons, was the largest ever U.S. under- 
ground test, and it faced opposition from 
environmental activists who chartered a 
fishing ship, christened it Greenpeace, and 
sailed it to the island in protest. Despite ap- 
peals to the Supreme Court, the test went 
as planned, creating a crater lake at the is- 
land’s surface even though the detonation 
was 1800 meters down. 

The two tests created another, much 
delayed splash last year at the AGU meet- 
ing. Vidale reported that waves from the 
detonations revealed not superrotation, but 
subrotation: During the time between the 
two U.S. tests, the inner core rotated more 
slowly than the rest of the planet, by some 
0.05° per year. Yet by the time of the Soviet 
tests, the inner core had somehow reversed 
course and sped up. The “observations are 
really amazing,” says Barbara Romanowicz, 
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a seismologist at the University of Califor- 
nia (UC), Berkeley. 

For Vidale, the pattern from 1969 to 1974, 
from slow to fast, may indicate a fundamen- 
tal rhythm of the inner core. For decades, 
radio astronomers have tracked minute 
changes in Earth’s surface rotation—the 
length of a day—against a cosmic reference 
frame: the fixed position of distant cosmic 
beacons called quasars. Although most of the 
yearly jitter is due to events like hurricanes 
and earthquakes, a tiny-but-regular 6-year 
wobble in day length has emerged. “Nobody 
has been able to say what causes it,’ says 
Benjamin Chao, a geodesist at Academia 
Sinica. “But everybody bets on the core.” 

Chao says one possible explanation for 
the 6-year cycle is gravitational interac- 
tions between the mantle and inner core. 
The inner core is likely to be lumpy, with 
hills hundreds of meters high, and at the 
bottom of the mantle, seismologists have 
discovered two ultradense, continent- 
size blobs. The tugs of the blobs on the 
hills could create a loose coupling be- 
tween the mantle and the inner core— 
enough to “pull the inner core back and 
forth” in cycles of superrotation and sub- 
rotation, Chao says. 

Song, however, only sees a slowdown, 
with no sign of an oscillation. He ties 
his record to a longer term trend in the 
length of a day, which saw the planet spin 
progressively faster from the 1970s before 
settling down in the early 2000s. Song 
thinks gravitational tugs from the mantle 
might have pulled the inner core along, 
but with a lag. 

Given that neither finding has yet 
been published, it’s hard to say how they 
fit together. “Is everybody right? Is every- 
body wrong?” Romanowicz asks. Either 
way, varying rotation seems more plau- 
sible than constant superrotation, says 
Miaki Ishii, a seismologist at Harvard Uni- 
versity. “It makes more sense than what we 
have right now.” 


THE INNER CORE is the most metal place on 
Earth—even more so than the outer core. 
Both are made mostly of iron, along with 
a smattering of nickel. But the iron is 
thought to also contain traces of lighter el- 
ements like oxygen, carbon, and silicon. As 
the iron crystallizes on the growing surface 
of the inner core, it spits out some of those 
elements, leaving behind almost pure iron, 
much as ice freezing from a bucket of salt- 
water expels the salt and becomes largely 
fresh. The expelled elements, lighter than 
iron, rise and sweep along the surrounding 
liquid, driving up to 80% of the convection 
that generates Earth’s magnetic field. 

The nature of the iron left behind is the 
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subject of ongoing debate. Iron atoms at 
Earth’s surface—in your cast iron skillet, for 
example—pack themselves in cubic arrange- 
ments. But when tiny samples of iron are 
compressed between two diamonds to inner 
core-like pressures, the atoms rearrange into 
hexagons. The hard question is what hap- 
pens when iron is simultaneously squashed 
and heated to thousands of degrees, says 
Lidunka Voéadlo, a computational mineral 
physicist at University College London. These 
conditions are difficult to re-create in the 
lab, because carbon in the diamonds often 


In 1971, a 5-megaton nuclear bomb was lowered into 
a borehole in Alaska. Seismic waves from the 
blast bounced off the inner core, helping gauge its spin. 


contaminates the iron when the apparatus 
is heated. But in computer models, Voéadlo 
says, “There’s no limit to the pressure and 
temperature you can get.” 

Modeling by Vocéadlo and her collabora- 
tors suggests hexagonal packing is the most 
stable arrangement under inner core condi- 
tions. The models also find that pure iron 
grows soft when it sits at 98% of its melt- 
ing point, as it may throughout much of 
the inner core. This “premelting effect,’ as 
it is called, could explain why S waves travel 
much slower than expected in the supposedly 
solid inner core. 

The story isn’t closed for cubic iron, how- 
ever. Just as water must cool below freezing 
before ice can nucleate, researchers have 
suggested iron can’t solidify directly into its 
hexagonal form unless it is nearly 1000 K 
cooler than the inner core. Atom-scale mod- 


eling published early this year by a team led 
by Yang Sun, a mineral physicist at Colum- 
bia, suggests a solution: Iron accreting onto 
the inner core could first crystallize into its 
cubic form before transitioning into a hex- 
agonal end state. 

Although the cubic versus hexagonal 
debate may seem academic, the structure 
may determine how the iron crystals align, 
how much nickel and other light elements 
can mix with the iron, how much heat it re- 
leases on crystallization, and even its melt- 
ing point. “The fundamental properties of 
iron change depending on what phase 
youre in,” Voéadlo says. 

A new wave of lab studies may help 
settle the question. After years of halting 
progress, researchers are on the verge of 
regularly re-creating and observing in- 
ner core conditions. One strategy is to 
press and heat iron in diamond anvil 
cells, as before—but to glimpse its struc- 
ture, quickly, before it is contaminated 
with carbon. New, powerful x-ray light 
sources such as the Extremely Brilliant 
Source at the European Synchrotron Ra- 
diation Facility, which turned on in 2020, 
can take that kind of flash photo. 

Another is to harness the massive 
lasers of the National Ignition Facility 
(NIF) at Lawrence Livermore National 
Laboratory (LLNL), which are typically 
aimed at pellets of hydrogen isotopes to 
spark tiny nuclear fusion reactions. In a 
study published earlier this year, NIF re- 
searchers instead turned some of those 
beams on iron, heating and pressuriz- 
ing it to levels far beyond those seen in 
Earth’s core. Each time they examined 
the iron’s structure with an x-ray, it came 
out the same—as hexagonal iron, says 
Richard Kraus, an LLNL research scien- 
tist who led the study. 

A third tack to re-create the inner core 
is through shock wave experiments. Jung-Fu 
Lin, an experimental mineral physicist at the 
University of Texas, Austin, has partnered 
with researchers in China who use bursts 
of gas to fire projectiles into iron at speeds 
10 times faster than a rifle bullet, generating 
corelike temperatures and pressures. They 
are already seeing hints of the premelting ef- 
fect identified by Voéadlo and predicted by 
others. If the results hold up, they may sug- 
gest the “solid” inner core isn’t so solid after 
all. “It’s like a smoothie,” Lin says. “Very soft.” 

If the inner core is a mystery, then the 
“innermost” inner core is a riddle wrapped 
in a mystery. Since the 1980s, seismologists 
have known that seismic waves run faster 
through the inner core along a north-south 
axis, perhaps because the iron crystals 
have a common alignment, presumably 
along the prevailing direction of Earth’s 
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Experiments at the National Ignition Facility have focused high-power lasers on small samples of iron to re-create inner core conditions. 


magnetic field. But in 2002, Ishii and Adam 
Dziewonski, also at Harvard, discovered 
that within a sphere roughly 600 kilometers 
across, that fast lane is tilted by 45° Ishii 
says that anomaly could be a relic of an 
ancient, tilted magnetic field or a kernel of 
cubic rather than hexagonal iron. No matter 
what, she says, “There’s something different 
going on at the center of the Earth.” 

Researchers are poised to turn these hints 
into something more rigorous. Over the 
past decade, a clutch of high-quality seismo- 
meters has been erected in Antarctica, allow- 
ing researchers to catch far more earthquake 
waves that pass through the inner core’s 
north-south fast lanes. Armed with the im- 
proved resolutions provided by these waves 
and many others globally, Utrecht’s Deuss 
and her graduate student Henry Brett used 
a supercomputing-based technique to create 
the first 3D view of the inner core—a bit like 
a CT scan in the hospital. 

This work, set for publication soon, con- 
firms the existence of the innermost core, 
but finds it is slightly offset from the plan- 
et’s center. It also reveals speed differences 
between the fast lanes seen in the inner 
core’s western and eastern hemispheres. 
That suggests the story of the fast lanes 
is more complicated than iron crystals 
aligning with the dominant magnetic field, 
which would have a more uniform signal. 
It’s still early days, similar to where imag- 
ing of the mantle was in the 1980s, but 
Brett says more detailed models are com- 
ing soon. “We're going to be able to ask 
more interesting questions.” 


ALL THIS COMPLEXITY appears to be geo- 
logically recent. Scientists once placed the 


22 1 APRIL 2022 » VOL 376 ISSUE 6588 


inner core’s birth back near the planet’s 
formation. But a decade ago, researchers 
found, using diamond anvils at outer core 
conditions, that iron conducts heat at least 
twice as fast as previously thought. Cooling 
drives the growth of the inner core, so the 
rapid heat loss combined with the inner 
core’s current size meant it was unlikely 
to have formed more than 1 billion years 
ago, and more than likely came even later. 
“There’s no way around a relatively recent 
appearance of the inner core,” says Bruce 
Buffett, a geodynamicist at UC Berkeley. 


“The dynamo could 
have been close to dying.” 


Peter Driscoll, 
Carnegie Institution for Science 


Tarduno realized rocks from the time 
might record the dramatic magnetic field 
changes expected at the inner core’s birth. 
Until recently, the paleomagnetic data 
from 600 million to 1 billion years ago 
were sparse. So Tarduno went searching 
for rocks of the right age containing tiny, 
needle-shaped crystals of the mineral tit- 
anomagnetite, which record the magnetic 
field’s strength at the time of their crys- 
tallization. In a 565-million-year-old vol- 
canic formation on the north bank of the 
St. Lawrence River in Quebec, his team 
found the crystals—and convincing evi- 
dence that the magnetic field of the time 
was one-tenth the present day strength, 
they reported in 2019. The fragility of the 
field at the time has since been confirmed 
by multiple studies. 


It was probably a sign that rapid heat 
loss from the outer core was weakening 
the convective motions that generate the 
magnetic field, says Peter Driscoll, a geo- 
dynamicist at the Carnegie Institution for 
Science. “The dynamo could have been 
close to dying,” he says. Its death could 
have left Earth’s developing life—which 
mostly lived in the ocean as microbes and 
protojellyfish—exposed to far more ra- 
diation from solar flares. In Earth’s atmo- 
sphere, where oxygen levels were rising, 
the increased radiation could have ionized 
some of this oxygen, allowing it to escape 
to space and depleting a valuable resource 
for life, Tarduno says. “The potential for loss 
was gaining.” 

Just 30 million years later, the tide had 
turned in favor of life. Tarduno’s team went 
to quarries and roadcuts in the Wichita 
Mountains of Oklahoma and _ harvested 
532-million-year-old volcanic rocks. After 
analyzing the field strength frozen in the tiny 
magnetic needles, they found that its inten- 
sity had already jumped to 70% of present 
values, they reported at the AGU meeting. 
“That kind of nails it now,’ Tarduno says. He 
credits the growth of the inner core for the 
field jump, which he says is “the true signa- 
ture of inner core nucleation.” 

Around the same time, life experienced 
its own revolution: the Cambrian explo- 
sion, the rapid diversification of life that 
gave rise to most animal groups and 
eventually led to the first land animals, 
protomillipedes that ventured onto land 
some 425 million years ago. 

It just may be that the clement world they 
found owes much to the inner iron planet 
we'll never see, 5000 kilometers below. & 
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The HFSP Nakasone Award is awarded to scientists in recognition 
of pioneering work that has moved the frontier of the life sciences. 
This may encompass conceptual, experimental, or technological 
breakthroughs. The award recognizes the vision of former Prime 
Minister Nakasone of Japan in the creation of the Human Frontier 
Science Program (HFSP). 


The competition is open; it is not limited to HFSP awardees and 
there is no age limit for candidates. In selecting the awardee, the 
HFSPO Council of Scientists will pay particular attention to recent 
breakthroughs by early career scientists. The awardee will receive a 
commemorative gold medal, a small unrestricted research grant and 
an invitation to deliver the HFSP Nakasone Lecture at the 2023 HFSP 
Awardees Meeting. 


The deadline for nominations is 6 May 2022. 
The nomination form and further information are available at: 


www.hfsp.org/awardees/hfsp-nakasone-award 
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Rules all Pls should follow 


We asked young scientists to write a rule that all principal investigators 
(PIs) should be required to follow to improve the experience of young 
scientists in their lab. Read a selection of their suggestions below. Follow 


NextGen Voices on Twitter with hashtag #+NextGenSci. 


Encourage debate 

Admit when you are wrong and encourage 
trainees to do the same. Science provides 
many opportunities to be wrong, whether 
it be disproving a favorite hypothesis or 
forgetting a step in a protocol. But young 
scientists often have a fear of failure or 
worry about facing repercussions for 
their mistakes. When PIs openly admit 
their missteps, they model how disproven 
hypotheses and mistakes are actually 
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—Jennifer Sills 


opportunities for learning and growth. 
Transparency about setbacks also empow- 
ers trainees to speak up with opposing 
opinions, making for richer discourse. 


Jennifer S. Chen 
Yale School of Medicine, New Haven, CT 06510, 


USA. Email: jennifer.s.chen@yale.edu 


Take students’ questions seriously and 
answer them constructively. PIs who show 
students this respect will stimulate 


their research motivation by demonstrat- 
ing that the lab is a safe environment to 
express opinions. 

Chih Ying Huang 

Department of Physics, National Cheng Kung 


University, Tainan 70101, Taiwan (Republic of 
China). Email: magicO910@gmail.com 


Give students the freedom to express 
their views. Encouraging all lab members 
to share their ideas will help establish a 
friendly environment in the lab, allow- 
ing young scientists to enjoy the research 
process and grow intellectually. 


Shantanu Lanke 
Maharashtra 422101, India. 
Email: slanke@asu.edu 


Be humble. When PIs interpret data, train- 
ees may not feel comfortable sharing an 
alternative opinion. By reminding trainees 
that a PI’s intuition may be wrong, the 
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PI can encourage young scientists to 
cultivate their curiosity, imagine new 
possibilities, develop original ideas, 
and judge results objectively. 

Michael Sanjay Fernandopulle 

Medical Scientist Training Program, 


Northwestern University Feinberg School of 
Medicine, Chicago, IL 60611, USA. 


Email: michael.fernandopulle@northwestern.edu 


Facilitate learning 


Invite at least one expert from another 
field to give a lecture to the research 
group every month. Learning about other 
disciplines will broaden young research- 
ers’ horizons, stimulate their thinking, 
and introduce them to different research 
methods. 

Yongsheng Ji 

Division of Life Sciences and Medicine, University 


of Science and Technology of China, Hefei, Anhui 
230026, China. Email: jiys2020@ustc.edu.cn 


After an experiment, model the correct 
way to clean the lab and document safety 
records. By doing this work together, 

PIs can increase young scientists’ safety 
awareness, cultivate careful and respon- 
sible methods, and foster a team spirit 
among the group. 

Yuan Zhi 


School of Economics, Guizhou University, 
Guiyang, Guizhou 550025, China. 


Email: yzhi@gzu.edu.cn 


Hold monthly lab meetings during which 
PhD students present their work. These 
presentations will allow young researchers 
to gain experience in data representation 
and science communication and give the 
PI an opportunity to monitor progress 
and offer relevant feedback. 

Sara Granado Rodriguez 

Department of Biology, Universidad Aut6noma 


de Madrid, Madrid, Madrid, 28049, Spain. 
Email: sara.granadol2@gmail.com 


Promote cooperation instead of competi- 
tion between students within your own 
lab and between labs. This strategy will 
create a supportive learning environment 
for students, enhance their productivity, 
and expand their skill set through shared 
knowledge and equipment. 


Andrea Y. Frommel 

Faculty of Land and Food Systems, University 
of British Columbia, Vancouver, BC V6T 124, 
Canada. Email: andrea.frommel@ubc.ca 


Foster independence 

When making a strategic project deci- 
sion, demonstrate how the decision is in 
the best interest of the young researchers 
involved. Shifting the burden of proof to 
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the PI would automatically give leverage 
to the young researcher in negotiating 
the project’s direction and give them the 
power to veto decisions that only benefit 
the PI. If respect for the young research- 
er’s career prospects were a standard 
criterion in decision-making, exploitative 
practices would become untenable. 
Martin LukaéiSin 

Faculty of Medicine, Technion, Haifa 3525433, 
Israel. Twitter: @CombinationBio 


If you move from one institution to 
another, help lab members find new posi- 
tions. When PIs accept job offers, young 
scientists in the lab they are leaving face 
substantial challenges. PIs should provide 
support to all lab members as they find 
new positions, even if they don’t plan to 
follow the PI to the new institution. 

Yida Zhang 

Department of Biomedical Informatics, Harvard 


University, Brookline, MA 02445, USA. Email: 
yida_zhang@hms.harvard.edu 


Encourage young scientists to do what’s 
best for their budding career, not 

what’s best for your established career. 
Supervisors should support and mentor 
their students and postdocs to help them 
build the skills required to get jobs and 
navigate their own career path. 

Christina N. Zdenek 

School of Biological Sciences, The University of 


Queensland, St Lucia, QLD 4072, Australia. 
Twitter: @CNZdenek 


Bring your students to academic confer- 
ences. Building a professional network is 
crucial for young scientists, and PIs can 
facilitate connections to senior scientists. 
Xiao-Yu Wu 

Department of Mechanical and Mechatronics 


Engineering, University of Waterloo, Waterloo, ON 
N2L 3G], Canada. Twitter: @XiaoYuWul0 


Provide students with opportunities to 
think creatively. Giving lab members the 
freedom to explore their unique strengths 
and encouraging them to think about 
problems from different perspectives will 
empower them and improve the quality of 
their research. 

Senthilkumar Seenuvasaragavan 


Mumbai 400076, India. 
Email: shrisenjohn@gmail.com 


Provide positive reinforcement 
Encourage your students uncondition- 
ally. Young scientists may give up if they 
receive no positive reinforcement for their 
hard work. PIs should let them try new 
things. If they succeed, PIs can give them 
affirmation. If they fail, PIs can remind 


them that the experience is a lesson that 
will help them improve. 

Yan Zhuang 

Peking-Tsinghua Center for Life Sciences, Peking 
University, Beijing 100871, China. 

Email: zhuangyanisme@pku.edu.cn 


Offer honest words of encouragement. 
Research is difficult, failure frequent, and 
feelings of inadequacy rampant among 
young scientists. Words of encouragement 
from a PI can be the difference between a 
student giving up or pushing through yet 
another experiment. 

Cathrine Bergh 


KTH Royal Institute of Technology, 11428 
Stockholm, Sweden. Email: cabergh@kth.se 


Recognize young scientists’ work both 
publicly and privately. By publicly giving 
lab members credit as well as privately 
communicating appreciation for their 
contributions, PIs can ensure that every- 
one in the lab feels respected. 

Jaime Coulbois 

Departamento de Ciencia Politica y Relaciones 
Internacionales, Universidad Auté6noma de 


Madrid, Ciudad Universitaria de Cantoblanco, 
Madrid, Spain. Twitter: @CoulboisJaime 


Create a supportive environment 
Support the individual goals and aspi- 
rations of each lab member. When PIs 
acknowledge each person’s unique per- 
spective, they can encourage cooperating 
and celebrating each other’s successes, 
creating an environment where everyone 
can flourish. 

Salam Salloum-Asfar 

Qatar Biomedical Research Institute, Hamad Bin 


Khalifa University, Qatar Foundation, Doha, Qatar. 
Twitter: @Dr_SalamSalloum 


Organize a homemade group dinner at 
least once a month. My mentor showed 
us that cooking dinner is just like doing 
experiments. While cooking, our group 
learned the spirit of collaboration. While 
eating, we talked freely about academic 
ideas or the progress we had made. These 
valuable experiences allowed us to relax 
and feel closer to one another. 

Bo Cao 

Core Research Laboratory, The Second Affiliated 
Hospital, School of Medicine, Xi’an Jiaotong 


University, Xi'an, Shaanxi 710004, China. 
Email: bocao@vip.qq.com 


Treat others as you would have liked to 
be treated—not how you were treated! 
Sometimes PIs believe that if they 
survived an unpleasant training environ- 
ment, then their students should, too. 
However, the best PIs try to improve the 
training experience by working together 
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with young scientists to overcome chal- 
lenges. If everyone were required to 
support members of their lab, the culture 
of academia would improve. 

Katherine Davis 


Imperial College London, London, UK. 
Twitter: @kd_katdavis 


Be responsive to the unique cultural 
needs, such as ethnicity, nationality, 
gender identity, sexual orientation, health, 
and socioeconomic status, of your men- 
tees. Culturally responsive mentorship 
provides mentees with more opportunities 
to succeed and a greater sense of comfort 
and safety. 

Fernanda Oda 

Department of Applied Behavioral Science, 


University of Kansas, Lawrence, KS 66045, USA. 
Email: oda@ku.edu 


Solicit feedback and act to address it. 
Once a year, the PI should request anony- 
mous feedback from all lab members 
about the lab’s culture, atmosphere, sci- 
ence, and integrity. The next year, the PI 
should report to both the lab group and 
an independent committee of peers what 
steps were taken to address concerns. 
Nikos Konstantinides 

Université Paris Cité, Centre National de la 


Recherche Scientifique, Institut Jacques Monod, 
F-75013 Paris, France. Twitter: @nkonst4 


Respect young scientists’ time 
Avoid contacting lab members by instant 
messaging applications outside of busi- 
ness hours. After work hours, PIs should 
respect their students’ time by using 
email, which alleviates the pressure to 
respond right away. 

Liping Zhang 

Department of Mechanical and Energy 
Engineering, Southern University of Science and 


Technology, Shenzhen, Guangdong 518055, China. 
Email: zhanglp@sustech.edu.cn 


Encourage students to finish their 
academic degree on schedule without 
extensions. PIs have an incentive to keep 
their PhD candidates for extra time in 
hopes of getting more data or finishing a 
project while paying a lower stipend than 
a postdoc would demand. Such delays 
adversely affect students’ careers and may 
make them feel frustrated. Moreover, des- 
peration to generate more experimental 
data could potentially lead to inappropri- 
ate experiments or fraud. 


Name withheld 
Calgary, Canada. 


Refrain from contacting lab members on 
weekends. Young scientists need a chance 
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to reflect, rejuvenate, and contextual- 

ize their work. Good science is driven by 
inspiration, and time away from the work 
environment is important for both the 
young scientist and the PI to plan their 
work, identify gaps and opportunities, and 
continually assess long-term goals. 
Divyansh Agarwal 

Harvard Medical School, Massachusetts General 


Hospital, Boston, MA 02114, USA. 
Email: dagarwal@mgh.harvard.edu 


Encourage attainable thesis milestones. 
PIs should help young scientists deter- 
mine when abandoning a project is in 
their best interest. 

Joseph Nicholas Rainaldi 

Department of Biomedical Science, University of 


California, San Diego, San Diego, CA 92122, USA. 
Email: jrainaldi@ucsd.edu 


Communicate regularly 

Talk to your students regularly and 
listen to their concerns. Frequent 
communication between PIs and young 
scientists facilitates immediate feedback, 
prevents studies from getting stuck, and 
helps students cope with failures. The PI 
must listen carefully, acknowledge the 
students’ opinions, and give them the 
space to have their own ideas. Ongoing 
dialogue will help students develop criti- 
cal thinking skills and allow them to start 
thinking independently. 

Jan Kadlec 

Department of Brain Sciences, Weizmann 


Institute of Science, Rehovot 7610001, Israel. 
Email: jan.kadlec@weizmann.ac. il 


Recognize that tasks you find trivial 

or routine, such as publishing a paper 
or finding out whether funding will be 
approved, may seem crucial to the people 
you supervise. Instead of letting emails 
about such issues languish, PIs should 
acknowledge the importance of the 
subject and provide a status update to 
prevent unnecessary stress. 

Jelle Vekeman 

Center for Molecular Modeling, Ghent University, 


9052 Zwijnaarde, Oost-Vlaanderen, Belgium. 
Email: jelle.vekeman@ugent.be 


Treat each of the young scientists in your 
lab to lunch once a month. Food is a great 
equalizer; sharing a meal can make a PI 
more approachable. The meeting can also 
serve as a check-in for the PI to see how 
each young scientist is doing, in an infor- 
mal environment where information and 
conversation flow more freely. 


Vishal Anirudh Kanigicherla 

Department of Chemistry, College of Arts and 
Sciences, University of Pennsylvania, Philadelphia, 
PA 19104, USA. Email: vkanigicherla@gmail.com 


Schedule regular one-to-one meetings with 
your PhD students. Such a routine for- 
mality will provide students with advance 
notice and allow them to practice 
presenting their work in an allotted time 
frame. Spontaneity does not work for 
everyone; students deserve the time and 
space to analyze, interpret, and articulate 
their own data before a meeting. 

Kathryn Oi 

Department of Chemistry, University of Zurich, 


ZH 8052 Zurich, Switzerland. 
Email: kathryn.oi@chem.uzh.ch 


Set realistic expectations 
Critically read a copy of your own PhD 
dissertation before evaluating the work 

of young scientists. This exercise will 
remind PIs that they also had a lot to learn 
early in their career, allowing them to set 
reasonable expectations for their students. 
Fostering a more understanding environ- 
ment could greatly improve mental health 
for everyone in the lab. 

Kyle J. Isaacson 


Ike Scientific, South Weber, UT 84405, USA. 
Twitter: @kjisaacson 


Actively engage in bench work. PIs who 
perform day-to-day experiments alongside 
their mentees will motivate young scien- 
tists and ensure that the PI is aware of the 
challenges they face. The daily presence 
of PIs will also break down the barriers 
between the mentor and mentee, which 
will lead to unhindered exchange of ideas 
and a more enjoyable and productive work 
environment free of unrealistic expecta- 
tions from either party. 

Rakesh Ganji 

Department of Developmental, Molecular, and 


Chemical Biology, Tufts University, Boston, MA 
02111, USA. Email: rakesh.ganji@tufts.edu 


Clarify boundaries early. For young 
scientists, the “getting to know you” 
phase of working with a PI can be stress- 
ful. Specifying details such as when it is 
okay to email, how much time is needed 
when requesting recommendation letters, 
and whether unprompted doorway chats 
are welcome can be critical to making 
young scientists comfortable in a lab. 
Moreover, this sets a good example for 
young scientists to confidently set their 
own boundaries, which, in the long term, 
can help to fight burnout and imposter 
syndrome. 

Emma Dawson-Glass 

Research Department, Holden Arboretum, 


Kirtland, OH 44094, USA. 
Email: edawson-glass@holdenfg.org 


10.1126/science.abp9887 
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The road to a larger brain 


Ecological opportunities in the early Cenozoic favored larger, not smarter, mammals 


By Felisa A. Smith 


odern mammals have the largest 
brains relative to their body size 
among vertebrates (J, 2). Because 
brain size is correlated with cog- 
nitive ability, complex behavior, 
and puzzle-solving abilities, this 
suggests that mammals are better able 
than other animals to interface with their 
environment (3, 4). But how and why did 
mammals evolve large brain sizes rela- 
tive to their body mass? The question of 
whether the rapid morphological evolu- 
tion of mammals after the Cretaceous- 
Paleogene extinction was accompanied by 
equally rapid changes in brain size speaks 
to the role of ecology in macroevolution. 
On page 80 of this issue, Bertrand et al. (5) 
address the idea of “brawn before brain” 
by characterizing the timing and pattern 


Department of Biology, University of 
New Mexico, Albuquerque, NM 87131-0001, USA. 
Email: fasmith@unm.edu 
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of mammal brain development across 
the Early Jurassic to the middle Cenozoic 
(~200 million to 30 million years ago). 
The phylogenetic encephalization quo- 
tient (PEQ) is a measure of relative brain 
size: the ratio of actual to expected brain 
volume for a given body size and phylog- 
eny. Encephalization refers to an increase 
in this ratio. Although the mammal lin- 
eage experienced several sudden increases 
in encephalization during the Mesozoic— 
perhaps driven by selection for enhanced 
hearing, smell, taste, and feeding abili- 
ties—the range of body mass and PEQ was 
relatively limited compared with that of 
modern mammals (6-10). This changed 
with the extinction of nonavian dinosaurs 
at the Cretaceous-Paleogene boundary 
(66 million years ago) as surviving mam- 
mals rapidly diversified to occupy newly 
vacated ecological niches (10, 17). But was 
this rapid increase in body mass accompa- 
nied by an equally rapid and proportional 
increase in brain size (8)? Or did encepha- 
lization occur at a different pace? That is, 


< 


we 
ie 


Fossils of Morganucodon— 

an extinct genus from the 
Mesozoic—revealed a major 
~~~ step in brain evolution in 
mammaliaform animals. 


did mammals first become “smarter” then 
larger, or larger then smarter? 

Across contemporary mammals, there 
is a highly nonlinear scaling of body and 
brain size, with a strong phylogenetic 
signal (2, 4). Slopes and intercepts vary 
substantially among mammalian orders, 
with groups such as carnivores exhibiting 
steeper slopes than those of ungulates (2, 
4). Recent work demonstrates that the al- 
lometric scaling of brain to body size ap- 
pears to have been repatterned multiple 
times during the Cenozoic, especially near 
the Cretaceous-Paleogene and Paleogene- 
Neogene boundaries (~23 million years 
ago), when changing environmental con- 
ditions may have precipitated evolution- 
ary innovations and ecological radiations 
(4). However, it is difficult to determine 
how and when mammals encephalized to 
near-modern levels because well-preserved 
skulls dating to the Mesozoic and early 
Cenozoic are fairly rare. Moreover, inves- 
tigating brain development is complicated 
because relative or absolute brain size may 
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not be the best measure of cognition (4). 
Ideally, studies should examine brain com- 
plexity and the relative proportions of differ- 
ent brain regions—the volume of brain de- 
voted to higher cognition versus mediating 
basic autonomic and sensory functions (4). 

Bertrand et al. combined previously 
reported endocranial data with high-res- 
olution computed tomography (CT) im- 
ages of newly discovered early Cenozoic 
mammal skulls. Not only did they measure 
brain size, they also characterized the indi- 
vidual sensory components for many fos- 
sils. These included the neocortex, which 
is related to higher-order brain functions 
such as sensory perception and cogni- 
tion; the olfactory bulbs, which are related 
to smell; and the cerebellar petrosal lob- 
ules, which are related to eye movement. 
The authors examined 124 extinct spe- 
cies dating from early in mammal evolu- 
tion to the middle Cenozoic. Their study 
was particularly data-rich for the crucial 
period immediately after the Cretaceous- 
Paleogene extinction, which marked the 
onset of rapid morphological and ecologi- 
cal diversification in terrestrial mammals 
and an expansion in mass of more than 
four orders of magnitude (J0). They found 
that immediately after that extinction, 
mammals were actually less encephalized 
than earlier, or later, in their evolution- 
ary history. Mammals expanded in body 
size much faster than brain size as they 
diversified after the end-Cretaceous ex- 
tinction—favoring “brawn” before “brain” 
(8). The reduction in PEQ was relatively 
short-lived, however. By the early Eocene 
(~50 million years ago), relative brain size 
had substantially expanded in all clades, 
although more so in crown orders (earli- 
est members of extant placental orders) 
than in stem taxa (archaic placentals). The 
rate of encephalization slowed by 40 mil- 
lion years ago, about when morphological 
diversification plateaued (J0). 

Notably, changes in relative brain vol- 
ume over the Cenozoic were not uniform 
across the different sensory regions; larger 
brains were not simply scaled-up versions 
of smaller brains (4). Rather, there was a 
large increase in the neocortex, whereas 
the proportion of the brain devoted to ol- 
faction decreased. And after an initial de- 
crease in the relative size of the petrosal 
lobules after the Cretaceous-Paleogene, 
they too increased over time. These 
changes in brain composition were more 
marked in crown orders than in stem taxa, 
supporting the idea that differences con- 
tributed to the eventual decline and extinc- 
tion of the latter. Overall, encephalization 
after the initial drop in the Paleocene was 
driven by the expansion of brain regions 
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mediating balance, vision and eye move- 
ment, head control, and sensory integra- 
tion, rather than olfaction. 

Brains are energetically expensive, us- 
ing almost an order of magnitude more 
energy per gram than that of other body 
tissues (12). Evolving a larger brain rela- 
tive to body size can necessitate higher 
basal metabolic rates and/or reduced al- 
location to other essential physiological 
processes (12-14), although trade-offs re- 
main unclear (2, 4). Given this cost, why 
have mammals undergone such remark- 
able encephalization over the Cenozoic? 
Although numerous hypotheses have been 
proposed—including selection related to 
sociality, improved hearing, feeding, taste, 
olfaction, miniaturization, parental care, 
endothermy, and nocturnality [for exam- 
ple, (6, 8)], the ecological niche is likely 
important (2, 4, 15). Bertrand et al. posit 
that as mammals evolved larger sizes and 
ecosystems began saturating, interspecific 
competition intensified selection for larger 
relative brains. For example, predators de- 
veloped substantially larger PEQ than that 
of prey groups by the Eocene. Earlier work 
reported a transition between the growth 
and saturating phase of body size evolution 
at this time, which is consistent with the 
saturation of ecological niches (10). 

Understanding the trade-offs between the 
high energetic demands of brain tissue and 
the potential value of increased brain size 
for survival and reproductive success re- 
mains a challenge. Studies such as Bertrand 
et al. highlight the utility of the fossil record 
for disentangling fundamental allometric 
scaling among contemporary mammals, as 
well as highlighting the central role of ecol- 
ogy in mammal macroevolution. 
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Perovskite 
solar cells can 
take the heat 


Layered surface structure 
adds durability to packaged 
perovskite cells 


By Joseph M. Luther and Laura T. Schelhas 


olar panels must endure extreme con- 

ditions during their service lifetime. 

There are few other products, let alone 

pristine semiconductor electronics, 

that are manufactured today and ex- 

pected to withstand decades of abuse 
from full exposure to outdoor elements, 
including harsh ultraviolet sunlight, rain 
and hail, and daily and seasonal thermal 
cycles. Over the past decade, a new class 
of semiconductors, known as metal halide 
perovskites, has emerged, which constitutes 
the light-absorbing material in solar panels 
and has made incredible advancements, 
potentially taking photovoltaics in new di- 
rections. Perovskite materials have been 
touted as a potential game changer for solar 
energy production (Z). On page 73 of this is- 
sue, Azmi et al. (2) present a reproducible 
layered surface structure of the perovskite 
for improved durability. The resulting 
perovskite cells can maintain performance 
under high heat and humidity. 

Perovskite solar panels promise an effi- 
cient, low-cost, and simple-to-manufacture 
solution that is on the cusp of commercializa- 
tion, as either a stand-alone technology or an 
add-on to silicon in a tandem configuration. 
However, naysayers of perovskite’s future po- 
tential often point to the lack of studies dem- 
onstrating durability in packaged cells and 
modules. In a commercial product, arrays 
of solar cells are packaged to create a solar 
module. Packaging is used to contain the cells 
and can protect them from the environment. 
Commercial solar panels are often subjected 
to accelerated stress testing to quickly assess 
their durability in the field. When developing 
new technologies, researchers cannot sim- 
ply place panels outside and wait multiple 
decades to assess stability; rather, they are 
forced to develop accelerated test procedures 
and protocols (3, 4). 
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One common testing protocol is the 
International Electrotechnical Commission’s 
61215 series of standardized tests (5). This test 
series is a pass-fail protocol that virtually ev- 
ery commercial solar module must pass. One 
of the tests in this protocol is the so-called 
damp heat test, which places the module in 
a chamber set to 85°C (185°F) with 85% rela- 
tive humidity for 1000 hours. It is designed to 
expose any weakness of the product against 
the combined stresses of heat and moisture. 
Perovskite solar materials have relatively low 
formation energy—in other words, it takes 
little energy to form or break atomic bonds 
within the material. Although this is benefi- 
cial for the fabrication of perovskites, it also 
makes them prone to degradation in high 
heat. Additionally, perovskites can be sensi- 
tive to moisture, further making this test one 
of the most difficult for perovskite technology 
to pass. A conventional way to survive the 
damp heat test is to physically limit the in- 
gress of water into the working components 
of the panel, thus decoupling and eliminating 
the humidity portion of the stress. It has been 
demonstrated that a glass sandwich, with the 
solar cell material in the middle and an edge 
seal containing desiccant, can prevent water 
from reaching the absorber and metal con- 
tacts (6, 7). Azmi et al. used this approach 
on highly efficient solar cells but found that 
they still degraded after damp heat exposure. 
Although their glass sandwich barrier can 


keep the moisture out, the device is still sub- 
jected to extreme thermal stress, suggesting 
that the perovskite was degrading from high 
heat and not moisture. To put this in context, 
the hottest air temperature ever recorded on 
earth is 58°C, and the solar panel itself may 
regularly reach temperatures of 65°C (8). The 
85°C temperature during the damp heat test 
is beyond the maximum temperature experi- 
enced, but the benchmark has proved to be 
a good substitute for assessing temperature- 
induced degradation processes. 

Azmi et al. improved the thermal stability 
of perovskite cells by treating the films with 
an oleylammonium iodide [(C,,H ' NH, JI 
solution and modifying the thermal process- 
ing during fabrication to alter the perovskite 
crystals’ surface structure (see the figure). 
The authors show that this treatment creates 
a layered surface structure, causing passiv- 
ation on each microscopic perovskite grain. 
Surfaces are often the most problematic por- 
tion of semiconductors, and passivation re- 
fers to terminal binding species that remove 
what could become electronic defects. This 
transformation has been shown to improve 
various aspects of performance and stability 
(9, 10). Layered perovskite-inspired materials 
are at the forefront of materials science re- 
search and are being considered for various 
applications, including solar cells, light-emit- 
ting diodes (LEDs), and other devices (11, 12). 
Microscopically speaking, layered perovskites 


Perovskite solar cell packaging 


The oleylammonium iodide surface treatment creates a thin layer of 
two dimensional (2D) perovskite, which aids in heat tolerance. The 
treated perovskite material can then be used to build the photovoltaic 
device, where glass and an edge seal protect the stack from humidity. 
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are composed of alternating lead halide octa- 
hedra and self-assembled molecules that are 
typically too large to allow the formation of 
a three-dimensional perovskite lattice. These 
layers can form various crystal structures de- 
pending on how the octahedra of one layer 
align with those of the adjacent layer. For 
example, the added oleylammonium cations 
from the treatment can be controlled to ori- 
ent between each lead halide layer or less fre- 
quently. Amzi et al. observed that films pro- 
cessed without annealing during fabrication 
created two octahedral units per layer adja- 
cent to the perovskite rather than one, which 
is shown to be a superior barrier. Another 
major win for this approach is the reproduc- 
ibility of the process. Consistent results were 
repeatedly obtained by many device makers 
within the research team. 

The 24% efficient perovskite solar cells 
that are stable under damp heat tests dem- 
onstrate a step in the right direction for 
perovskite solar panels. Thoughtful selec- 
tion of the package can prevent some deg- 
radation pathways, and informed materials 
engineering can open routes to improving 
the inherent thermal stability of perovskite 
solar materials. Despite all the excite- 
ment with perovskites, accelerated testing 
protocols such as this are no replacement 
for long-duration field studies that exist- 
ing technologies have completed over the 
years. There is a strong push to begin col- 
lecting outdoor performance and durability 
data now. As the community pushes this 
technology forward, more will be known 
regarding whether perovskites can contrib- 
ute as much to solar energy as anticipated. 
Continued dedication to these approaches 
can help enable commercialization. 
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Eco-evolutionary effects 
of keystone genes 


The rapid evolution of specific genes within 
species can drive ecological changes 


By Patrik NosiF and Zach Gompert? 


here is increasing evidence that ge- 

netic evolution can occur rapidly 

enough to affect the ecological dy- 

namics of populations and commu- 

nities (-3). To better predict the 

future of ecosystems, it is necessary 
to understand how evolutionary changes 
within species influence and interact with 
ecological changes through processes 
known as “eco-evolutionary dynamics” (4). 
On page 70 of this issue, Barbour et al. (5) 
demonstrate that a gene affecting a plant’s 
resistance to herbivory also influences 
the persistence of the food web through 
the gene’s effect on plant growth (see the 
figure). Subsequent studies of natural se- 
lection in the wild can help explain how 
variations of such “keystone genes” can 
be maintained (6, 7). The maintenance of 
genetic variation in keystone genes is re- 
quired for eco-evolutionary dynamics to be 
perpetual rather than transient. 

It is important to determine the genes 
that underlie eco-evolutionary dynamics 
because genetic details, such as the num- 
ber and biological effect of genes that af- 
fect traits, are expected to influence evolu- 
tion. In this context, by identifying a gene 
with marked ecological effects, the study 
by Barbour et al. offers several key in- 
sights. The authors constructed an experi- 
mental food web in the laboratory, which 
comprised a predator (a parasitoid wasp 
that attacks aphids), two herbivores (two 
species of aphids that eat plants), and the 
plant Arabidopsis thaliana—also known 
as thale cress, a model genetic organism. 
They studied three genes in the cress that 
showed variation in the wild. The genes 
were chosen because of their role in con- 
trolling chemical biosynthesis that could 
affect resistance to herbivory. 

Barbour et al. observed that one of the 
genes, AOP2, affects the persistence of the 
food web. That is, this gene affects extinc- 
tion risk and the collapse of the wider 
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community in the experimental ecosystem. 
Specifically, the null allele (an allele is a 
variant of a gene), which is a loss-of-func- 
tion mutation that occurs in natural thale 
cress populations, reduced the extinc- 
tion risk for the species in the food web. 
Critically, these effects are pronounced, 
with the null allele reducing extinction 
risk by 16% relative to the average allele, 
and by 29% relative to the AOP2+ allele, 
a specific AOP2 genetic variant. Just as 
keystone species have disproportionately 
strong effects in communities relative 
to other species (8), AOP2 is a keystone 
gene that has strong effects on ecological 
dynamics (9, 70). The authors combined 
their results with ecological modeling 

to elucidate the mechanisms under- _¢ 
lying the observed effects of AOP2 

on food web dynamics. Their analy- 

sis revealed that the gene affects extinction 
by altering species’ intrinsic growth rates 
in a manner that allows the parasitoid 
and the dominant aphid species to coexist. 
Thus, the results also inform the mecha- 
nisms by which species coexist—a major 
theme in ecology (/1). 

Given the revelation brought by these 
results, perhaps the most pressing ques- 
tion that needs answering is whether the 
variation at keystone genes is maintained 
over the long term, and if so, what are the 
mechanisms that preserve this variation 
over time (6, 7)? These are important ques- 
tions because genetic variation is the fuel 
for evolution. If the variation is lost, eco- 
evolutionary dynamics will not occur ex- 
cept when following variation introduced 
by the occasional mutation or gene flow. 
In classic population-genetic models of ad- 
aptation, genetic variation is expected to 
be lost as natural selection acts in a single, 
consistent direction that favors beneficial 
alleles in the population (6). Through this 
process, natural selection causes one allele 
to replace all others, eliminating genetic 
variation from the population. 

In contrast to this view of directional 
selection and the loss of variation, the 
ecological genetics literature emphasizes 
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Barbour et al. demonstrate in a laboratory setting 
that a gene affecting the herbivory resistance of a 
plant (pictured here as the thale cress) also influences 
extinction risks of other organisms in the community. 


“balancing selection,’ where heterozygotes 
(i.e., individuals with two different allele 
copies at a gene) have higher fitness than 
homozygotes or where selection fluctuates 
over time and is dependent on the envi- 
ronmental or genetic context (7). In such 
instances, the selection is not uniform, 
and genetic variation can be maintained. 
Determining how selection acts on key- 
stone genes in the wild is thus required to 
predict whether eco-evolutionary dynam- 
ics will be brief and transient (as predicted 
by directional selection), or more pervasive 
and perpetual (as predicted by balancing 
selection) (J—3). 

Another future direction concerns the 
genetic architecture of traits that exhibit 
ecological effects (4, 12, 13). Although 
traits controlled by single genes of large 
effect certainly exist, such as the Agouti 
gene affecting coat color in mice (12), 


science.org SCIENCE 


PHOTO: NIALL BENVIE/MINDEN PICTURES 


many traits exhibit continuous phenotypic 
variation underlain by many genes (/4). In 
such cases, individual genes have minor ef- 
fects on trait variation. Moreover, not only 
genes but also the environment influences 
trait expression. 

Additionally, even apparent cases of sin- 
gle genes with strong effects on a trait may 
represent multiple genes that are tightly 
linked (i.e., physically located) on the same 
chromosome, as occurs in “supergenes” 
(13). Studying such complex genetic archi- 
tectures is more challenging than studying 
single genes, as evidenced by difficulties 
in genetic mapping of human disease and 
complex behavioral traits (14). Testing how 
traits underlain by many genes affect eco- 
logical dynamics is a challenging, yet im- 
portant, avenue for future work. A caveat is 
that even if major effect loci are relatively 
rare, they could be more likely than mi- 
nor effect loci to exert marked ecological 
effects (1). 

Further studies that combine disci- 
plines such as ecology, genetics, and math- 
ematical modeling are likely to invigorate 
the field of eco-evolutionary dynamics. 
Although simple systems are a powerful 
and useful starting point for such work, 
most eco-evolutionary systems are more 
complex because of the interactions and 
feedback among and within ecological and 
evolutionary processes, and complex com- 
munities and trait genetics (]—3). This com- 
plexity of eco-evolutionary systems must 
be unraveled to elucidate if these dynam- 
ics will be gradual or abrupt, and how the 
dynamics can be characterized by tipping 
points in ecosystems (15). Such knowledge 
will inform the importance of evolution for 
ecological dynamics and biodiversity. 
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EVOLUTION AND COGNITION 


Selective cultural processes 
generate adaptive heuristics 


Less intuitive, hard-to-learn cognitive heuristics can thrive 


By Joseph Henrich 


ong before the arrival of astrolabes, 

compasses, or marine chronometers, 

Micronesian navigators guided ca- 

noes across hundreds of miles of 

open ocean by integrating informa- 

tion from stars, waves, coral reefs 
and more into a rich conceptual frame- 
work. Looking into the heavens, they saw 
a celestial compass and knew which rising 
and setting stars to follow. They used three 
distinct types of waves—originating from 
the south, east, and northeast—as direc- 
tional references (J, 2). Equipped with such 
cognitive tools, the ancestors of these Mi- 
cronesians spread across Oceania, sailing 
from the bustling islands of Southeast Asia 
to the remote reaches of Hawaii, Easter 
Island, and South America. How do such 
complex cognitive technologies emerge? 
On page 95 of this issue, Thompson ez al. 
(3) describe taking an experimental ap- 
proach to the question of how opportu- 
nities to selectively learn from successful 
role models can favor the spread of more 
adaptive, but less intuitive, cognitive heu- 
ristics over more intuitive and memorable 
alternatives. 

To hunt, gather, farm, fish, and tackle 
countless other challenges, humans have 
relied on a dizzying array of complex, lo- 
cally adaptive heuristics (4, 5). Yet the 
origins of such heuristics present a puzzle 
because the best protocols and practices 
are often not the easiest to learn, remem- 
ber, and teach. Even when one manages to 
master a more effective, but less intuitive, 
technique, it will deteriorate as it’s trans- 
mitted and transformed by the minds of 
others over generations (6). Researchers 
have long considered how ensembles of 
psychological biases, preferences, and in- 
ferences make certain ideas, stories, songs, 
and concepts catchier—easier to acquire, 
recall, and retransmit—and have deployed 
these “cognitive attractors” to account for 
the recurrent patterns found across societ- 
ies in domains such as religion, literature, 
art, and folk biology (7-10). But given the 
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pervasiveness of such cognitive attractors, 
how can the impressive assemblages of 
nonintuitive heuristics and hard-to-learn 
cognitive abilities that have permitted hu- 
mans to dominate Earth’s major ecosys- 
tems be accounted for? 

Approaching this puzzle, cultural evo- 
lutionary theorists were inspired by eth- 
nographic accounts such as those for 
Micronesia, where all young males aspired 
to become master navigators—the pinnacle 
of local prestige—by apprenticing under 
the most respected masters (J). Few aspi- 
rants, however, succeeded in being initi- 
ated as masters because some of the most 
important skills were nonintuitive, diffi- 
cult to learn, and unforgiving of medioc- 
rity—miscalculating by a few degrees could 
mean sailing past an atoll and dehydrating 
or starving, lost in the Pacific’s vastness. 
Using mathematical models, theorists have 
shown that if learners selectively attend to 
models or teachers based on cues of pres- 
tige, success, and skill, this can drive the 
spread of less intuitive heuristics by com- 
pensating for errors that creep in when 
complex heuristics are transmitted from 
person to person (6). 

To assess the role of selective cultural 
learning in creating adaptive heuristics, 
Thompson et al. paid experimental par- 
ticipants according to how efficiently they 
could complete a sorting task. Participants 
had to correctly order six tiles using the 
fewest number of paired comparisons and 
were told that the tiles themselves were 
uninformative—only the paired compari- 
sons revealed ordering information. Par- 
ticipants had to sort nine different sextets 
of tiles. After completing their assign- 
ment, they each wrote up what they had 
learned for the next generation of sorters 
and passed it along with a demonstration 
of their sorting strategy. Sorters were fur- 
ther paid according to the success of those 
who tapped them as “teachers.” After the 
first generation, which served as an asocial 
treatment in which sorters had no oppor- 
tunity to learn from others, participants in 
generations 2 to 12 were randomly placed 
into either the selective social learning or 
random mixing treatments. In both treat- 
ments, naive sorters could select up to 
three teachers from a set of eight partici- 
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An illustration from the early 1800s depicts canoes near the Mariana Islands in Micronesia. 


pants from the prior generation. The only 
difference was that sorters in the selective 
social learning treatment could observe 
the monetary payoffs (“success”) of their 
eight potential teachers, whereas those in 
the random mixing treatment observed 
only their teachers’ arbitrary identification 
numbers. When learners selected a par- 
ticular model, they had an opportunity to 
read their teacher’s write-up, observe their 
demonstration, and practice their strategy. 
When individuals could preferentially 
learn from more successful sorters, their 
average performances increased across 
time such that, by the final generations, 
their scores exceeded those in the random 
mixing and asocial treatments by 24 and 
70%, respectively. By the end, the randomly 
chosen participants who could selectively 
learn from others seemed smarter: They 
had better cognitive algorithms for solving 
this kind of problem. This not only confirms 
prior work (//, 12), illustrating how selective 
cultural learning can foster the evolution of 
more efficient artifacts (e.g., kayaks and 
bows), but also shows that it can generate 
better information-processing algorithms. 
Closer inspections of the data reveal 
even more notable results. Although there 
are a vast number of possible ways to go 
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about this sorting process, only eight dis- 
tinct and identifiable algorithms emerged 
in the social treatments, and only two—se- 
lection sort and gnome sort—dominated 
the final three generations. Meanwhile, in 
the asocial treatment, most sorters used id- 
iosyncratic approaches that didn’t approxi- 
mate any known algorithm and generally 
performed poorly. Thus, when social learn- 
ing was possible, a method’s transmissi- 
bility, fidelity, and memorability rapidly 
shaped the variation in sorting algorithms, 
creating quasi-discrete cultural units for 
selective learning to act on (6). 

Of the two leading algorithms, selection 
sort was substantially easier to learn and/ 
or transmit—and thus represents a stron- 
ger cognitive attractor—but was 30% less 
efficient than gnome sort at the sorting 
task. The difficulty of transmitting and 
learning the gnome algorithm is seen by 
the more complex written instructions 
provided by gnome teachers and the more 
frequent errors made by those trying to 
imitate gnome during their practice round. 
Consequently, in the random mixing treat- 
ment, when gnome did sometimes emerge 
and begin to spread, errors introduced 
during the transmission process degraded 
its effectiveness, allowing selection sort to 


out compete it. By contrast, when selec- 
tive learning was possible, gnome not only 
survived but often diffused across the pop- 
ulation and was sustained at sizable fre- 
quencies. Here, selective processes favored 
the more efficient, but harder to learn, 
cognitive attractor by allowing learners to 
pick models who hadn’t inadvertently in- 
troduced errors, thus filtering out errors 
each generation. Indeed, only in the selec- 
tive social learning treatment did a stable 
cadre of gnome “master sorters” emerge. 
As with Micronesian navigators, nearly all 
master sorters (>85%) had master sorters 
as teachers, though most students of mas- 
ter sorters did not become master sorters 
themselves. This experimental evidence 
elegantly converges with well-established 
ethnographic patterns (73, 14). 

These results highlight a deeper point: 
Humans don’t have culture because we're 
smart, were smart because we have cul- 
ture (3). The selective processes of cultural 
evolution not only generate more sophisti- 
cated practices and technologies but also 
produce new cognitive tools—algorithms— 
that make humans better adapted to the 
ecological and institutional challenges that 
we confront. Thompson et al’s results un- 
derline the need for the psychological sci- 
ences to abandon their implicit reliance on 
a digital computer metaphor of the mind 
(hardware versus software) and transform 
into a historical science that considers not 
just how cultural evolution shapes what we 
think (our mental contents) but also how 
we think [our cognitive processes (15)]. | 
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Pain-resolving microglia 


An emergent subgroup of spinal cord microglia mediates recovery from persistent pain 


By George Sideris-Lampretsas 
and Marzia Malcangio 


europathic pain after peripheral 

nerve damage is a lasting condition 

that generally persists even when the 

cause of damage disappears (1). The 

immune system is integral to the de- 

velopment of neuropathic pain: In the 
spinal cord, microglia—the central nervous 
system-resident macrophages—respond to 
neuronal activity and set up a positive feed- 
back loop with neurons that promotes pain 
onset. Thus, disruption of microglia-neuron 
communication is being considered as a 
strategy to produce analgesia. On page 86 of 
this issue, Kohno et al. (2) describe a spinal 
cord-resident pool of microglia that emerges 
during pain maintenance and contributes to 
the resolution of neuropathic pain in mice. 
Microglial heterogeneity is a well-accepted 
concept that is underexplored in the context 
of chronic pain (3). The finding that spinal 
cord microglia acquire spatial and temporal 
transcriptional heterogeneity that affects 
pain could identify new therapeutic strate- 
gies to relieve pain. 

In response to peripheral nerve trauma, the 
nociceptive system is activated. Nociception is 
the response of the nervous system, including 
the spinal cord, to noxious stimulation and 
tissue damage. Painful stimuli are detected 
by peripheral nerve fibers and transmitted 
to their central terminal in the dorsal horn 
of the spinal cord. There, microglia promptly 
proliferate and up-regulate the expression 
of various genes that are implicated in pro- 
nociceptive pathways (4). Microglia play a 
pronociceptive role at the onset of pain, in a 
sex-dependent manner: The involvement of 
microglial signaling is reliable in male mice, 
whereas some pathways appear not to be 
effective in female mice (5). However, these 
cells are considered less relevant during the 
maintenance of neuropathic pain (4). 

By contrast, the study of Kohno et al. 
shows that in the maintenance phase, an 
emergent CDllc* microglial pool engages 
in clearance of myelin. The prevention of 
CD11c* microglia emergence is sufficient for 
pain to be maintained in a mouse model of 
neuropathic pain, indicating that they are 
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essential for pain remission. These microglia 
develop an antinociceptive function through 
the release of anti-inflammatory mediators, 
such as insulin-like growth factor 1 (IGF-1). 
Thus, under neuropathic conditions, microg- 
lia can serve diverse roles in a temporal man- 
ner alongside pain development and remit- 
tance. Some microglia initially respond to 
neuronal activation and contribute to pain 
onset; then, with time, this distinct pool of 
microglia appears, expands, and eventually 
establishes a new microglia-to-neuron com- 
munication that is associated with pain reso- 
lution (see the figure). 


myelin integrity in some fiber terminals. At 
this time, CD11c* microglia are not yet anti- 
nociceptive. Loss of myelin integrity may be 
the result of degenerative processes affecting 
injured neurons that terminate in the spinal 
cord. Alternatively, loss of myelin integrity 
may be mediated by increased neuronal ex- 
citation in response to injury; hence, microg- 
lial modulation of myelin integrity may indi- 
rectly control neural circuit functionality (8). 

By 5 weeks after nerve injury, when pain 
has fully resolved in the mouse model, 
CDlic* microglia express IGF-1, which 
confers a sex-independent antinociceptive 


Dorsal horn (spinal cord) 


Microglia subsets during neuropathic pain 
Kohno et al. (2) identify a cluster of CD11c* microglia that emerge during 
pain maintenance and promote remittance of neuropathic pain after 
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After peripheral nerve injury, microglial 
activation in the dorsal horn engages 
pronociceptive communication with 


During pain maintenance, a cluster of 
CD11c* microglia emerge, express the 
receptor AXL, and engulf myelin debris 


These phagocytic CDllc* microglia 
release insulin-like growth factor 1 
(IGF-1), which exerts antinociceptive 


neurons by releasing cytokines [such 
as interleukin-1p (IL-18)] that facilitate 
neuronal activity. 


What is the local cue that triggers antino- 
ciceptive microglia emergence during neuro- 
pathic pain? In brain development and dis- 
ease, CD11c* microglia appear in response to 
disturbance of homeostasis, and they check 
neuronal health by increasing phagocytic po- 
tentials (6). Kohno e¢ al. found that at 2 weeks 
after nerve injury in mice, CD11c* microglia 
express the receptor tyrosine kinase AXL, 
which alongside the receptor tyrosine kinase 
MER detects and engulfs amyloid-8 aggre- 
gates in the brains of people with Alzheimer’s 
disease (7). Kohno et al. observe the CD11c* 
microglial population engulfing myelin basic 
protein, which suggests that they are devoted 
to taking up debris resulting from loss of 


from degenerating processes that 
affect injured neurons. 


action through activation of neuronal 
IGF-1 receptor (IGF1R). 


effect to microglia in these neuropathic 
conditions. Thus, throughout the pain- 
maintenance phase, CD1lc* phagocytic mi- 
croglia gradually become antinociceptive 
and eventually contribute to the resolution 
of nociceptive hypersensitivity through re- 
lease of IGF-1. Because the receptor for IGF-1 
is expressed by neurons in the dorsal horn 
but also by oligodendrocytes, astrocytes, en- 
dothelial cells, and microglia, there remains 
scope for further studies on cellular targets 
of CD1ic* microglia-derived IGF-1. 
Microglia can change function in response 
to regional cues through fast and effective 
regulation of gene expression. For example, 
transcriptomics changes in neurodegenera- 
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tive conditions show that subsets of microg- 
lia shift from a dynamic homeostatic profile 
and acquire a specific disease-associated mi- 
croglia (DAM) transcriptional signature (9). 
In pain remission, CD1lc* microglia show 
some similarities to DAMs because they dif- 
ferentially express genes such as apolipopro- 
tein E (Apoe) and Aail. Nevertheless, CD11c* 
microglia in the neuropathic spinal cord of 
mice have a distinct transcriptional profile 
from the DAM subset. This transcriptional 
profile is associated with antinociceptive 
microglial functional output. A cluster of 
microglia reminiscent of the antinociceptive 
CD11c* microglia that express Apoe, Axl, and 
Igf1 are also detected in the spinal cord of 
mice 5 months after peripheral nerve injury 
(10). These results support the suggestion 
that a subset of microglia expressing a well- 
preserved network of transcripts appears 
weeks after pain onset, with Kohno e¢ al. 
finding that these microglia have antinoci- 
ceptive properties. 

One of the most valuable implications of 
this work is how this new mechanism that 
stimulates the emergence of pain-killing 
microglia can be translated into effective 
treatments. Better understanding of the 
role of IGF1 and APOE in antinociceptive 
microglia would facilitate the design of 
therapeutic paradigms that promote anal- 
gesia. Furthermore, definition of the context 
required for the emergence of antinocicep- 
tive microglia would be invaluable to deter- 
mine how to induce protective microglia in 
chronic pain conditions. Therefore, studies 
on pain remission states that are not charac- 
terized by loss of myelin integrity, and hence 
are not associated with enhanced microglial 
phagocytic activity, would illustrate whether 
microglia can then acquire an antinocicep- 
tive phenotype. It remains to be shown 
whether the constant uptake of myelin might 
eventually cause phagocytic exhaustion and 
microglia dystrophy, which will affect neuro- 
nal health. Overall, the study of Kohno e¢ al. 
highlights that it is time to reassess the role 
of microglia in pain as a therapeutic target 
for the effective treatment of chronic pain. 
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GENOMICS 


A next-generation human 
genome sequence 


A near-complete sequence outlines a path 


for a more inclusive reference 


By Deanna M. Church 


wenty-one years ago, two initial ver- 

sions of a human genome sequence 

were published by Celera Genomics 

(7) and the Human Genome Project 

(HGP) (2). These assemblies were in- 

complete and replete with errors, but 
despite these flaws, the value of having a hu- 
man genomic reference assembly was clear. 
The assembly produced by the HGP went on 
to be “finished” (3) and has been continually 
updated over the past decade (4). Despite the 
scientific and economic value of the refer- 
ence (5), it has many shortcomings, includ- 
ing that it is not actually finished. On page 44 
of this issue, Nurk et al. (6) provide the most 
complete reference assembly for any mam- 
mal. This new human reference is poised to 
have its own substantial impact on genome 
analysis and represents an important step to 
assembly models that represent all humans, 
which will better support personalized medi- 
cine, population genome analysis, and ge- 
nome editing. 

The current version of the human genome 
reference assembly, GRCh38.p14 (GRCh38), 
has millions of bases represented by the let- 
ter “N,” which means that the actual base re- 
siding at that location is unknown. There are 
also 169 sequences that cannot confidently 
be ordered or oriented within the assembly, 
typically owing to their repetitive nature, and 
that get carried along as a bag of sequences 
that are hard to analyze. Biologically impor- 
tant regions—such as the short arms of ac- 
rocentric chromosomes, centromeres, and 
several duplicated euchromatic regions—are 
not represented, are represented by model 
sequences, or are incorrectly represented. 
These sequences account for ~8% of the hu- 
man genome. Until recently, limitations of 
sequencing technology, primarily that the 
sequencers could read no more than about 
1000 bases at a time, it was impossible to cor- 
rectly assemble the sequence reads of these 
regions. But they have important biological 
functions and, in some cases, these regions 
are associated with human disease. 
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When the idea of sequencing the human 
genome was initiated, it was not a universally 
popular idea, nor was there a clear techno- 
logical blueprint of how the project would be 
completed (7). The initial focus was on map- 
ping-based approaches to better understand 
the human genome and to further develop 
sequencing technology. There were public 
debates on the best way to proceed, with 
some pushing for an approach that required 
no mapping, just chopping up the whole ge- 
nome and sequencing it in a “whole-genome 
shotgun” sequencing approach (8); others 
argued against this (9). The HGP opted for 
a more structured approach. This involved 
cloning genomic DNA into pieces that could 
be grown in bacteria (clones) and indexed in 
96-well plates. Clones from these libraries 
were first mapped to chromosome regions, 
then individually sequenced. The finished 
clones were assembled to create the chro- 
mosome sequences found in the assembly. 
Celera Genomics opted for the shotgun ap- 
proach, which was also used by the Telomere- 
to-Telomere (T2T) Consortium and is the 
dominant method today. The comparison of 
the two assemblies (JO) led to a better under- 
standing of both approaches and of the hu- 
man genome sequence. 

An important attribute of the human ref- 
erence assembly is that the source DNA was 
derived from multiple individuals. One of the 
first lessons in genetics is that humans in- 
herit one set of chromosomes from each par- 
ent. This leads to a duplication of each chro- 
mosome, except for the sex chromosomes in 
individuals harboring an XY configuration. 
Each pair of chromosomes are not exact du- 
plicates; there can be hundreds of thousands 
of differences between them, including the 
presence or absence of large stretches of se- 
quence on one chromosome. The sequence 
on an individual chromosome is called a 
haplotype. One reason the bacterial clon- 
ing approach was preferred by the HGP was 
“because each clone represents a single hap- 
lotype, problems caused by the presence of 
polymorphisms are eliminated” [(9), p. 411]. 
That DNA from many donors was pooled in 
both approaches suggests that the extent of 
individual variation was not fully appreciated 
at the time. For example, when two clones 
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from different haplotypes of an individual 
are adjacent in the reference assembly, this 
can create sequence representations that 
are not normally found in the population. In 
some cases, the sequence of the two haplo- 
types can be very different, leading to artifi- 
cial sequence gaps because correct overlaps 
between sequence reads cannot be obtained. 

Nurk e¢ al., and other studies from the T2T 
Consortium, eliminated the problem of allelic 
diversity by sequencing the genome of a cell 
line derived from a complete hydatidiform 
mole (CHM). This abnormal product of con- 
ception arises when only one copy of a paren- 
tal genome is retained after fertilization. This 
is duplicated so that the cell contains two 
copies of the same parental genome, which 
is incompatible with fetal viability and leads 
to placental overgrowth, called a molar preg- 
nancy. Cell lines derived from CHM eliminate 
allelic variation so that an assembly approach 
can focus on resolving sequences that have 
more than two copies in a genome (paralo- 
gous duplication). By focusing on one cell line 
that contains a single haplotype, and thus no 
allelic differences between the chromosomes 
that can confuse assembly algorithms, many 
inaccuracies of GRCh38 could be corrected. 
The one deficit of this assembly is the lack 
of a Y chromosome, because the CHM13 cell 
line does not contain one. 

The new assembly provides access to cen- 
tromeric sequences (11), ribosomal sequences 
(6), and an improved representation of eu- 
chromatic segmental duplications (SDs) (72). 
SDs are long, often >100-kb sequences with 
>90% sequence identity that are often as- 
sociated with chromosome rearrangements, 
many of which lead to neurodevelopmental 
disorders. The assembly of these complex 
regions is only now possible because of the 
advent of accurate long-read (~20 kb) and 
accurate-enough ultra-long-read (>100 kb) 
sequencing. Having reads that can span the 
length of repetitive sequences allows for 
their correct placement and representation 
in the assembly. These sequences are com- 
plex and dynamic in the human population, 
and although there is more complexity in the 
population than is represented in the T2T- 
CHM13 assembly, having one correctly as- 
sembled human haplotype facilitates analy- 
sis of these biomedically important regions 
in other humans and nonhuman primates 
(12). However, an assembly of a single hap- 
lotype, even one as high quality as this, is in- 
sufficient for representing the full diversity 
of the human population. An exciting aspect 
of the study by Nurk e¢ al. is that it lays the 
framework for how more high-quality as- 
semblies can be achieved, and thus more 
insight into population- and individual-level 
diversity can be obtained. 

The quality and completeness of the T2T- 
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A more complete reference 


The new human genome assembly, T2T-CHM13 from the Telomere-to-Telomere 
Consortium, includes complex and repetitive regions of chromosomes that 
had not been included in previous versions of the human genome assembly 
(GRCh38). Although the Y chromosome remains to be completed, this new 
reference could be annotated with regulatory regions, variants, and sequence 
diversity to give a fuller picture of human genomic variation. 


Chromosome 1 
GRCh38 
T2T-CHM13 
Sample 1 
Sample2 Se 
Sample 3 
This individual has This individual This individual 
an extra copy of a is missing a has a sequence 


sequence that is in 
the reference. 


CHM13 assembly suggests that this should 
replace GRCh38, with a representation of the 
Y chromosome added. An immediate solu- 
tion would be to add the Y chromosome from 
GRCh38. Longer-term solutions involve im- 
proved methods that allow assemblies of dip- 
loid genomes, a goal of the T2T Consortium. 
Considerable challenges remain in achieving 
this transition. A reference assembly is not 
just a substrate for sequence alignment and 
variant calling, it is also the framework on 
which biological information is annotated. 
Although the T2T Consortium provided gene- 
level annotation for the new assembly, it is 
still not possible to take full advantage of the 
improvements in T2T-CHM13. Remapping 
annotations from GRCh38 to T2T-CHMI13 is 
insufficient because of regions in T2T-CHM13 
that have no correlate in GRCh38. The rean- 
notation of large datasets—such as DNA vari- 
ants in gnomAD (13), which remains split be- 
tween annotating GRCh37 and GRCh38, and 
regulatory information found in ENCODE 
(74)—will be substantial endeavors but criti- 
cal for the full adoption of T2T-CHM13. 

It is an exciting advance to have a human 
assembly this complete and accurate. But 
this is only a first step in developing better as- 
sembly models that allow the representation 
of any sequence found in a human genome. 
Despite the rich array of human diversity, 
the reference assembly has always been rep- 
resented as a set of single, linear sequences, 
one for each chromosome. This single linear 
representation limits our ability to analyze all 
genomes because sequences that are present 
on some chromosomes, but not in the refer- 
ence, cannot be analyzed. It is necessary to 
continue developing assembly models that 
allow representation of the rich sequence di- 
versity found in the human population (see 
the figure). Understanding this diversity will 
help dissect disease risks observed in dif- 
ferent populations. The work to date by the 
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T2T Consortium is an important part of this 
vision, because it provides the blueprint by 
which routine, de novo assembly of individ- 
ual genomes will be possible. The continued 
work of the T2T Consortium, along with the 
Human Pangenome Reference Consortium 
(15), which aims to produce high-quality as- 
semblies from diverse human populations, 
will provide additional protocols for routine, 
diploid assemblies as well as the data struc- 
tures and tools needed to produce a refer- 
ence assembly that can represent all pos- 
sible sequences in a population. This type 
of representation is important not only for 
population-based analysis but also for indi- 
vidual-genome analysis. Better resolution of 
both sets of chromosomes in an individual 
will give a full picture of the genome and un- 
doubtedly help improve personalized medi- 
cine. In addition, these data models will al- 
low better representation of the programmed 
variation that is now possible using genome- 
editing tools. Assemblies that can also repre- 
sent the diversity of the human population 
will better serve science and humanity. 
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RETROSPECTIVE 


Paul Farmer (1959-2022) 


Humanitarian and public health pioneer 


By Bill Gates 


n 2005, I had the opportunity to visit 
Paul Farmer, who dedicated his life to 
helping people access medical care in 
the world’s poorest countries. As Me- 
linda and I walked around the clinic he 
had helped build in Cange, a small town 
in central Haiti, it was clear how much Paul’s 
patients loved him. The facility was filled 
with people who had traveled many miles to 
receive treatment for diseases such as chol- 
era, tuberculosis, and HIV. Paul seemed to 
know the names of everyone we passed, often 
pausing to ask how they were feeling. 

At one point, we stopped to talk 
to a woman who only spoke Haitian 
Creole. Paul, who was fluent in the 
language, translated for us. The 
woman launched into what was obvi- 
ously a long and effusive story about 
him—I could hear the words “Dokté 
Paul” spoken several times. When she 
finished, I asked Paul what she had 
said. He blushed as he replied, “Just 
some obligatory praise for me.” 

Paul was never comfortable being 
the center of attention, but it is hard to 
imagine anyone being more worthy of 
praise. He was born in North Adams, 
Massachusetts, in 1959. As a teenager, 
he and his family spent a summer 
picking citrus fruit in Florida along- 
side Haitian immigrants. A few years 
later, he got to know Haitian farm 
workers in North Carolina’s tobacco 
fields. These experiences sparked an 
interest in Haiti that would determine 
the course of his life. 

After graduating from Duke 
University in 1982 with a degree with medi- 
cal anthropology, Paul moved to Cange to 
work in a local hospital. He was horrified 
to learn that people had to pay up front—an 
insurmountable barrier for many Haitians— 
for even the most basic supplies before re- 
ceiving treatment. Resolved to create a bet- 
ter system, he began studying at Harvard 
Medical School, splitting his time between 
Cange and Cambridge. He earned an MD 
and a PhD in anthropology in 1990. In ad- 
dition to pursuing his global public health 
work, he became a professor and later a de- 
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partment chair at Harvard Medical School, 
positions he held until his death. 

In 1987, Paul cofounded the incredible or- 
ganization that will be his legacy: Partners 
in Health (PIH). PIH was built on Paul’s core 
belief that everyone deserves good health 
and access to high-quality services. Today, 
the organization helps run medical clinics 
in 12 countries, providing phenomenal care 
in everything from ophthalmology to wom- 
en’s health to infectious diseases. Paul spent 
much of his later years traveling from clinic 
to clinic, working with staff to ensure their 
patients received the best treatment possible. 


| we 


A big believer in bringing care directly to 
people in their homes, Paul structured PIH 
accordingly. In addition to the medical fa- 
cilities that PIH runs—each one staffed by 
locals, providing stable jobs to the commu- 
nity—it oversees a robust network of com- 
munity health workers. Last year alone, PIH 
arranged for 2.1 million in-home health visits 
around the world. 

The Gates Foundation has been a proud 
funder of PIH for more than 16 years, and 
it has been amazing to watch the organiza- 
tion grow under Paul’s leadership. In 2014, 
Melinda and I traveled back to Haiti to see 
how Paul’s work had changed and grown 
since the devastating earthquake 4 years ear- 


lier. PIH had recently opened a new hospital 
in Mirebalais that was solar-powered and 
had all the equipment you would expect to 
see in a first-class facility, including incuba- 
tors, fully equipped operating theaters, and 
a CT scanner. It was remarkable to see Paul’s 
vision come to life: a world where your eco- 
nomic status does not limit your access to 
high-quality care. 

Although Paul’s most lasting impact was 
on the patients he loved so dearly, it is im- 
possible to ignore his broader influence on 
the global health community. One of the 
best examples is his work treating AIDS 
in Haiti. After the first antiretroviral treat- 
ments for the disease became available in 
the late 1980s, many experts assumed they 
would not be a feasible solution in poor 
countries. Paul proved them wrong by go- 
ing door to door to deliver the medicine 
and help his patients keep up with their 
treatment. His work paved the way for 
future international efforts such as 
the US President’s Emergency Plan 
for AIDS Relief (PEPFAR), which has 
saved 20 million lives and prevented 
millions of HIV infections since 2003. 

Paul wrote and published a dozen 
books throughout his life, including 
Reimagining Global Health, which 
was based on a course he cotaught at 
Harvard. He was the recipient of many 
awards, including the Berggruen 
Prize for Philosophy and Culture, 
the American Medical Association’s 
Outstanding International Physician 
Award, and a John D. and Catherine T. 
MacArthur Foundation fellowship. But 
the only time Paul actively sought the 
spotlight was when he knew he had 
an opportunity to highlight inequality. 
He was often asked to give commence- 
ment speeches, and I suspect he is the 
reason a lot of young people have en- 
tered careers in public health. He was 
one of the most inspirational people I 
have ever met. 

No one was a better or more tire- 
less advocate for the health of the poor than 
Paul. In Tracy Kidder’s book Mountains 
Beyond Mountains, Paul is quoted as say- 
ing, “There’s always somebody not getting 
treatment. I can’t stand that.” Paul died in 
Butaro, Rwanda, where he was teaching 
students at the University of Global Health 
Equity, which is run by PIH. He spent his 
last days treating patients at the nearby 
hospital and educating the next generation 
of global health leaders. His work will con- 
tinue through them and through Partners 
in Health and the many people he trained 
and inspired over the years. & 
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A push for inclusive data 
collection in STEM organizations 


Professional societies could better survey, and thus better 
serve, underrepresented groups 


By Nicholas P. Burnett’, Alyssa M. Hernandez?, 
Emily E. King’, Richelle L. Tanner’, 
Kathryn Wilsterman®© 


rofessional organizations in science, 
technology, engineering, and math- 
ematics (STEM) are well-positioned 
to improve the recruitment and re- 
tention (R&R) of underrepresented 
groups (J, 2) by providing targeted 
professional development, networking op- 
portunities, and political advocacy (3, 4). 
Tailoring these initiatives to specific under- 
represented groups can enhance their im- 
pact (5), but this is predicated on organiza- 
tions knowing their demographic make-up 
(6). Here, we report patterns in STEM or- 
ganizations’ collection and usage of demo- 
graphic data from members and conference 
attendees, based on information from 73 
professional societies representing 712,000 
constituents. In light of inconsistencies and 
limitations that we observed, we suggest sur- 
vey programs that can serve as models for in- 
clusive survey designs by organizations and, 
where possible, provide demographic infor- 
mation for benchmarking relative to the gen- 
eral population. With improved surveys, or- 
ganizations can leverage demographic data 
to prioritize and evaluate R&R efforts, and 
share effective strategies for R&R of under- 
represented groups across STEM. 

Baseline demographic data, when com- 
pared to the general population (across 
STEM or across a country), can help orga- 
nizations set and prioritize R&R goals and, 
when monitored over time, help organiza- 
tions evaluate the effectiveness of R&R ef- 
forts. Government agencies often provide 
the most relevant demographic data for a 
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general population in STEM (e.g., the US 
National Science Foundation) and in a coun- 
try (e.g., the US Census Bureau) because of 
their surveys’ large sample sizes and broad 
distributions. As a result, organizations may 
be compelled to use government surveys as a 
model for demographic survey design and for 
benchmarking (i.e., comparing their organi- 
zation’s demographic diversity to the general 
population). Although these agencies survey 
many categories of demographic informa- 
tion (e.g., gender identity, family status, citi- 
zenship, abilities, race and ethnicity), they do 
not necessarily collect all the demographic 
information that is considered meaningful to 
describe the STEM community—i.e., treating 
some groups as homogeneous (7) and ignor- 
ing other groups completely (8). By contrast, 
inclusive demographic surveys acknowl- 
edge the full diversity of identities that are 
meaningful among members of the STEM 
community. Thus, organizations seeking to 
describe their demographic composition are 
pressured to choose between following the 
examples of government agencies versus cre- 
ating new, inclusive surveys to recognize ad- 
ditional (and evolving) identities within the 
STEM community. 


A SURVEY OF SURVEYS 

We surveyed 164 STEM organizations (73 
responses, rate = 44.5%) between December 
2020 and July 2021 with the goal of under- 
standing what demographic data each orga- 
nization collects from its constituents (i.e., 
members and conference attendees) and how 
the data are used. See supplementary mate- 
rials (SM) for more details on the question- 
naire. Organizations were sourced from a list 
of professional societies affiliated with the 
American Association for the Advancement 
of Science (AAAS, the publisher of Science) 
(n = 156) or from social media (n = 8). The 
survey was sent to the elected leadership and 
management firms for each organization, and 
follow-up reminders were sent after 1 month. 
Although these organizations can have inter- 
national memberships of up to 40% (6), we 
focus our study on the demographic data that 


are culturally important in the US, under the 
assumption that US-affiliated organizations 
usually hold events in the US and thus the 
relevant demographic context is US based. 

The responding organizations repre- 
sented a wide range of fields: 31 life sci- 
ence organizations (157,000 constituents), 
5 mathematics organizations (93,000 con- 
stituents), 16 physical science organizations 
(207,000 constituents), 7 technology orga- 
nizations (124,000 constituents), and 14 
multidisciplinary organizations spanning 
multiple branches of STEM (131,000 con- 
stituents). A list of the responding organiza- 
tions is available in the SM. 

Our survey may have selected for organiza- 
tions that are committed to, or are interested 
in, collecting demographic data. However, 
based on the AAAS-affiliated recruitment 
of the organizations and the similar sizes 
of constituencies across STEM fields, we 
conclude that the responding organizations 
are likely a reasonably representative cross- 
section of the most prominent STEM orga- 
nizations in the US. Each organization was 
asked about the demographic information 
that they collect from their constituents, the 
response rates to their surveys, and how the 
data were used. Organizations participated 
under the condition that no data would be 
associated with them directly. Hence we re- 
port aggregated findings and de-identified 
organization-level data, and do not report 
any group- or discipline-specific patterns to 
minimize potential for identification. 

Most STEM organizations (59 out of 73, 
or 80.8%) collect demographic information 
from their constituents. Commonly surveyed 
demographic categories included sexual ori- 
entation, disability status, racial and ethnic 
identity, and gender identity (see the top 
panel of the figure). The number of options 
offered for each demographic category var- 
ied among organizations, resulting in data- 
sets with different resolution and validity 
(i.e., how accurately the response options 
reflect the identities of the respondents). 
Specific response options for each demo- 
graphic category are shown (see the box) and 
provided in the SM. 

Of the organizations that provided re- 
sponse rates to their surveys (n = 22), the av- 
erage response rate was 36.1% (SD = 30.4%), 
which is close to response rates reported by 
other organizations (9). When asked the year 
of their most recent demographic survey, 29 
out of 59 organizations (49.2%) indicated ei- 
ther 2020 or 2021, 7 (11.9%) indicated a year 
from 2012 to 2019, and 15 (25.4%) indicated 
that they collect demographic information on 
a rolling basis with member registration. 

Of the organizations that collected demo- 
graphic data and shared details of their data 
usage (n = 48), 875% reported using demo- 
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graphic data for one or more purposes that 
fell in the general categories of temporal 
monitoring, resource planning (e.g., for con- 
ferences), publishing reports (e.g., internal 
or external reports summarizing organiza- 
tional growth), writing grant proposals, and 
contributing to third-party research (see the 
bottom panel of the figure). Some respon- 
dents provided specific examples, which in- 
cluded using data to create statistical reports 
for an organization’s board of directors, writ- 
ing proposals and progress reports to federal 
funding agencies, and ensuring diverse rep- 
resentation on organizational service com- 
mittees and speaking panels. Thus, there are 
disparities between STEM organizations in 
the underlying design of demographics ques- 
tions, the administration of surveys, and the 
usage of the collected data. 

Our results indicate that some STEM or- 
ganizations do not recognize 
entire groups in STEM, in- 
cluding individuals in sexual 
minorities (i.e., LGBTQ+ peo- 
ple) or individuals with dis- 
abilities (see the top panel of 
the figure). This observation 
is surprising given the well- 
documented discrimination 
and underrepresentation of 
these groups in STEM (8, 
10). Furthermore, variation 
in response options on de- 
mographic questions—e.g., 
for racial and ethnic identity 
and gender identity—signals 
that only a fraction of orga- 
nizations aim to give a voice 
to distinctive identities that 
are frequently relegated to 
broader demographic clas- 
sifications. For instance, nu- 
merous Asian American and 
Pacific Islander (AAPI) eth- 
nic groups are often consoli- 
dated into a single “Asian” or 
“AAPI” grouping (7). 

In response to survey de- 
signs that ignore or obfuscate 
demographic identities, in- 
dividuals from underrepre- 
sented groups may elect not 
to respond to certain ques- 
tions or elect not to complete the survey, in- 
troducing a nonresponse bias into the data. 
As a result, the collected data do not directly 
reflect the true demographic composition of 
the organizations. Nonresponse biases may 
become even more prominent if surveys lack 
anonymity—e.g., data collected during mem- 
ber registration are likely linked to a respon- 
dent’s identity (11, 12). Cumulatively, these 
findings suggest that the bulk of professional 
organizations in STEM are not collecting de- 
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What data are collected? 1-2 
Sexual orientation 


Disability status 


Gender identity 


Third-party research 
Grant proposals 
Published reports 
Resource planning 
Monitoring 


Any use 


mographic data that are representative of 
the true diversity within STEM, which mis- 
informs any subsequent use of the data for 
supporting or guiding organizational opera- 
tions such as R&R. 


GUIDES FOR SURVEY DESIGN 

Our data and the recommendations 
made here are based on a subsample of 
US-affiliated STEM organizations, and 
thus they are limited in part by the de- 
mographic categories that are promi- 
nent and culturally important in the US. 
Nonetheless, we believe that the principles 
outlined below are applicable to both US 
and international organizations. STEM 
organizations can look to national survey- 
ing programs with publicly available data 
to model demographic survey designs and 
provide benchmarking of organizational 


Demographic data collection and usage in STEM 

We obtained information about demographic data collection and usage from 73 STEM 
organizations, representing 712,000 members and conference attendees. Organizations 
most commonly collected race, ethnicity, and gender identity information but with different 
resolution and validity (top). Demographic data were commonly used for monitoring and 
resource planning (bottom). “Any use” refers to one or more of the individually listed uses. 


Number of options on survey 
3-4 @5-6 @7-8 @9-10 @>10 
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diversity relative to the general population 
(13), though these national programs may 
have flaws. In the US, three such programs 
are the National Science Foundation’s 
Survey of Earned Doctorates (SED), the 
US Census Bureau’s American Community 
Survey (ACS), and the Centers for Disease 
Control and Prevention’s National Health 
Interview Survey (NHIS). 

The SED targets individuals receiving 
research doctorates in the US (~55,000 per 


year), whereas the ACS targets the general 
US population (~3.5 million households per 
year). Thus, the SED and ACS can provide 
relevant benchmarking data for demo- 
graphic surveys of STEM organizations, but 
the diversity of identities recognized in their 
survey questions pales in comparison to the 
questions of the smaller NHIS (~32,000 to 
59,000 households per year, 2015 to 2020). 
For example, in questions on race, the 
NHIS and ACS each recognize seven dis- 
tinctive Asian identities, whereas the SED 
recognizes only a singular Asian group. 
For ethnicity, the NHIS recognizes seven 
distinctive Hispanic or Latinx identities, 
whereas the ACS and SED each recognize 
only four. Organizations wishing to describe 
international constituents should avoid de- 
scribing racial groups as “American” (e.g., 
“Asian American”) and consider asking for 
country of residence (6). 
However, “residence” can 
have several definitions, 
such as legal versus _his- 
torical, so surveys should 
be explicit in their use of 
this term. 

In addition to race and 
ethnicity, the NHIS asks 
sexual orientation (SED 
and ACS do not explicitly 
do this), and the survey in- 
cludes a series of questions 
regarding cognitive, motor, 
visual, and auditory abilities 
that are more comprehen- 
sive in scope and response 
options than the abilities- 
related questions of the SED 
or ACS. Thus, for racial and 
ethnic identity, sexual orien- 
tation, and disability status, 
the NHIS is likely the most 
effective, all-in-one guide 
for question design and 
benchmarking by STEM or- 
ganizations. Unfortunately, 
the SED, ACS, and NHIS do 
not ask questions explicitly 
related to gender or trans- 

100 gender identity. Programs 
that survey these categories 
do not, to our knowledge, 

publicly release data that would be helpful 
for benchmarking by STEM organizations, 
but their survey questions can still act as 
guides. For instance, Indiana University’s 
National Survey of Student Engagement 
provides a model for gender identity ques- 
tions (with additional, nonbinary response 
options), and a survey from the advocacy 
group, National Center for Transgender 
Equality, provides a model for transgender 
identity questions. Outside of the US, orga- 


100 


science.org SCIENCE 


GRAPHIC: K. FRANKLIN/SCIENCE 


nizations can follow these same guidelines 
by identifying large-scale, local survey 
programs that collect and report meaning- 
ful data from the general population and 
modifying the survey as needed to provide 
inclusive questions and options. 

STEM organizations may wish to sur- 
vey demographic categories beyond those 
discussed here, which were limited to the 
four most surveyed categories observed in 
our dataset (see the top panel of the fig- 
ure). We encourage organizational leaders 
to find inclusive guides and benchmarking 
data for additional categories using repu- 
table sources, such as the national survey 
programs described above. And just as na- 
tional surveys evolve over time, organiza- 
tional leaders should openly include new 
questions and response options to reflect 
and recognize levels of diversity that are 
important to the STEM community. 

Adding new, inclusive survey elements 
does not necessarily compromise long- 
term studies of organizational diversity 
(i.e., continuity between surveys). In many 
cases, the inclusive suggestions described 
here involve recognizing distinctive identi- 
ties from within groups historically treated 
as homogeneous. If needed, data from the 
updated (inclusive) surveys may be com- 
pared to historical values by statistically 
aggregating these newly recognized groups. 
In other cases, new identities and dimen- 
sions of diversity are recognized. Continuity 
between surveys is then less important be- 
cause historical surveys were not collecting 
information that is now considered correct 
or meaningful to the community. 

Inclusive survey questions and response 
options alone do not guarantee represen- 
tative demographic data. Other aspects 
of surveys can prompt or prevent entire 
groups from responding, resulting in non- 
response biases. Response rates in general 
can be improved—and nonresponse bias 
reduced—by ensuring anonymity (e.g., 
delinking surveys from member registra- 
tion), sending reminders, minimizing sur- 
vey length, and providing incentives (JJ, 
14). Furthermore, for sensitive questions, 
response rates may also benefit from pro- 
viding a justification for data collection, 
such as that the results will influence spe- 
cific R&R initiatives. 

It is our hope that the guides presented 
here will enable more STEM organizations 
to quantify demographic diversity among 
their constituencies and use the data to in- 
form and evaluate R&R efforts. To improve 
capacity to compare the efficacy of R&R 
efforts, partnering STEM organizations 
may wish to develop compatible frame- 
works for data collection by first identify- 
ing a common survey program for survey 
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Different surveys that ask 
about disability status offer 
different response options 


Surveyed STEM organizations 
provided varying sets of options 

for responding to a question about 
disability status on demographic 
surveys. See supplementary materi- 
als for response options provided for 
other questions regarding racial and 
ethnic identity, gender identity, and 
sexual orientation. 


Survey A (ten options) 

Sensory impairment (vision or 
hearing) * Learning disability (e.g., 
ADHD, dyslexia) * Autism spectrum 
disorder « Long-term medical illness 
(e.g., epilepsy, cystic fibrosis) * Speech or 
language impairment * Mobility limitation 
or orthopedic impairment ¢ Mental 
health disorder (e.g., Major depressive 
disorder) * Temporary impairment 

due to illness or injury * Disability or 
impairment not listed - diagnosed or 
undiagnosed * | do not identify with a 
disability or impairment 


Survey B (ten options) 

Deaf/hard of hearing * Visual 
impairment ¢ Mobility impairment « 
Cognitive or learning disability * Mental 
health diagnosis * Neuroatypical « 
Autoimmune or pain disorder « 
Moderate to severe allergies/asthma/ 
environmental sensitivities * Other ¢ 
None of the above 


Survey C (six options) 

Difficulty seeing * Hearing * Speaking 
Walking « Lifting/carrying * 
Concentrating/remembering 


Survey D (five options) 

| have serious difficulty walking or 
climbing stairs * | am deaf or have 
serious difficulty hearing » | have been 
diagnosed by a health professional 

as having permanent memory loss or 
learning disability * | am blind or have 
difficulty seeing even when wearing 
glasses « None of the above 


Survey E (five options) 

Hearing impairment * Vision 
impairment ¢ Learning disability « 
Mobility/orthopedic impairment * Other 


Survey F (three options) 
Yes ¢ No « Specify if you wish 


Survey G (two options) 
Yes « No 


design and benchmarking data, and then 
agreeing on the demographic questions 
and response options to ask constituents. 
With this collaborative approach to sur- 
vey design, organizations can then share 
successful strategies for the R&R of spe- 
cific groups, resulting in impactful and 
widespread support of underrepresented 
groups across STEM. 

Future work in this field should seek 
to understand how organizational recog- 
nition of diverse identities influences the 
R&R of underrepresented groups, and how 
this relationship varies across disciplines, 
organization types, and world regions. 
For instance, the analysis presented here 
is an aggregate snapshot of US-affiliated 
organizations across all areas of STEM. 
Do some STEM disciplines stand out in 
their approach to demographic data col- 
lection and R&R—e.g., in ways that could 
be good examples for other disciplines? 
Do for-profit organizations approach de- 
mographic diversity in ways that could 
inform R&R strategies in other organiza- 
tions? Detailed assessments in these areas 
could identify and refine key strategies for 
demographic data collection and usage, 
and the use of these surveys for assessing 
R&R more broadly. Disseminating these 
findings to all STEM organizations will fa- 
cilitate society-led support of underrepre- 
sented groups on a broad scale. 
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The scientist-statesman 


A new documentary capably navigates the life and legacy 
of Benjamin Franklin 


By R. B. Bernstein 


ased on a well-crafted, thoughtful 
script by Dayton Duncan and ably 
narrated by the actor Peter Coyote, 
Ken Burns’s new documentary, Ben- 
jamin Franklin, features enlighten- 
ing interviews with historians and 
biographers including Walter Isaacson, 
Erica Armstrong Dunbar, Christopher 
Brown, Sheila Skemp, Stacy Schiff, Joyce 
Chaplin, Joseph Ellis, Clay Jenkinson, Gor- 
don Wood, William lLeuchten- 
burg, and the late Bernard Bailyn. 
The actor Mandy Patinkin brings 
Franklin’s words to life, with 
other well-known actors such as 
Liam Neeson and Paul Giamatti 
voicing supporting characters. An 
effective musical score drawing on 18th- 
century European composers undergirds 
the documentary, and its equally rich vi- 
suals depict 18th-century Boston, Philadel- 
phia, London, and Paris, along with other 
places linked with Franklin’s life. 
Benjamin Franklin has two parts, the 
first tracing Franklin’s life in colonial 
Boston and Philadelphia (with European 
interludes) and the second following him 
through the American Revolution, his di- 
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plomacy efforts in the 1770s and 1780s, 
his service at the Federal Convention in 
Philadelphia in 1787, and his final years, 
concluding with his death in 1790 at the age 
of 84. This project blends its narrative of 
Franklin’s life with probing, thoughtful ex- 
planations of why his life mattered—to him, 
to his contemporaries, and to posterity. 

Part 1 begins with Benjamin’s birth in 
Boston in 1706, the 16th child and young- 
est son of Josiah Franklin, a Boston candle- 
maker, and Josiah’s second wife, Abiah 
Folger Franklin. A _ precocious 
child, Benjamin was apprenticed 
to his older brother James as a 
printer and, early on, demon- 
strated a talent for writing and an 
irrepressible independence. 

At age 17, Franklin ran away 
from his dictatorial brother, eventually set- 
tling in Philadelphia, where he pursued 
an increasingly busy and successful career 
as a printer. With only 2 years of formal 
schooling, he became the leading printer 
in colonial America, a successful craftsman 
and man of business, and one of the lead- 
ing American writers of his time. He was 
known in particular for his perennially suc- 
cessful Poor Richard’s Almanack, which he 
published from 1732 to 1758. 

As the first distinguished American sci- 
entist, he pioneered research in electricity 
and frequently invented or adapted tech- 
nological advances, including the Franklin 


Benjamin Franklin poses in the studio of French 
sculptor Jean Antoine Houdon. 


stove, the lightning rod, and bifocal spec- 
tacles. He mapped and explained scientifi- 
cally the Gulf Stream and also wrote a phe- 
nomenally successful autobiography (1). 

Franklin broadened his public spirit to in- 
clude activism in colonial and British imperial 
politics, a part of his life bridging parts 1 and 
2 of this documentary. A committed imperial- 
ist, he grew frustrated and saddened by the 
widening rift between American colonists and 
the mother country. Burns is especially effec- 
tive in telling this part of Franklin’s story and 
in charting Franklin’s abrupt shift from sup- 
porting the British Empire to embracing the 
American cause, ultimately rupturing beyond 
repair his relationship with his illegitimate 
son William, who had become the royal gov- 
ernor of New Jersey. Franklin never forgave 
William for his loyalism, the most anguished 
of Franklin’s failures in the realm of family. 

As part 2 shows, Franklin was far more suc- 
cessful as an American politician. He played a 
key role in the American declaration of inde- 
pendence, advising Thomas Jefferson’s writ- 
ing of the Declaration. The first American dip- 
lomat, he was American minister to France, 
securing the new nation’s alliance with France 
and then, along with John Jay and John 
Adams, negotiating the Treaty of Paris with 
Great Britain that recognized American inde- 
pendence and doubled the size of the new na- 
tion. And, despite his age and frailty, Franklin 
was a highly effective broker of compromises 
and voice of reasoned moderation during the 
Federal Convention’s struggles to frame the 
US Constitution. 

Like the 2002 Middlemarch documen- 
tary on Franklin (2), Burns’s film is most 
successful in challenging older scholarship’s 
celebration of Franklin as what his classic 
biographer Car] Van Doren called “a harmo- 
nious human multitude.” Burns’s Franklin, 
by contrast, is evolving and conflicted, a 
man of conventional white supremacist 
views who radically pivots in his percep- 
tions of African Americans and the evils of 
slavery, a man who shows the conflicts and 
inconsistencies of many of his contempo- 
raries on matters of sex and gender, a proud 
Briton who in his sixties and seventies be- 
comes an equally proud American founding 
father. Burns gives us a Benjamin Franklin 
for our own troubled times, an extraordi- 
nary and complicated man who never lost 
sight of what it means to be human. 
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War in an automated, data-driven world 


An anthropologist misses the mark on a topic of great concern 


By George Lucas 


he topics addressed by Roberto 
Gonzalez in his new book, War Virtu- 
ally, are ones that ought to concern 
us all. These include the intersections 
of Big Data and Big Tech and the bat- 
tlefield, lethal autonomous weapons 
systems, cyber conflict, and the enhance- 
ment of such technologies using increas- 
ingly sophisticated artificial intelligence. 

Gonzalez is an anthropologist specializ- 
ing in the Zapotec culture in the northern 
regions of Oaxaca. He has also been a vocal 
critic, along with many other social scien- 
tists, of attempts by the US Army and NATO 
allies to exploit the expertise of social scien- 
tists with knowledge of local cultures dur- 
ing the counterinsurgency and antiterrorist 
campaigns in Iraq and Afghanistan by 
means of a program known as the Human 
Terrain System (HTS). 

Authors of works on military technology 
need not be physical scientists or engineers. 
The political scientist and 21st-century war- 
fare expert Peter Singer, whom Gonzalez 
cites, for example, is neither. And yet Singer 
is remarkably adept at providing a narra- 
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tive framework that encompasses both the 
promises and perils of the efforts of such 
individuals and that is intelligible and in- 
formative for a wide audience. 

Meanwhile, the uses and misuses of so- 
cial science and its practitioners is a con- 
troversial topic, on which the author is an 
unquestioned authority. However, his deci- 
sion to examine what the National Academy 
of Sciences recently termed “exotic” new 
military technologies through an interpre- 
tive framework constructed by examining 
the work of anthropologists Margaret Mead 
and Gregory Bateson during World War II 
and the Phoenix Program—a project in- 
tended to infiltrate and ultimately destroy 
the Viet Cong during the Vietnam War—is 
puzzling. One is left to wonder how such re- 
flections, interwoven with the author’s dis- 
cussions of other historical collaborations 
between social scientists and government 
and military entities, intersect with emerg- 
ing technologies. 

Applying anthropological methodolo- 
gies to contemporary issues such as cyber 
conflict, the social impacts of artificial intel- 
ligence, and military applications of social 
media is a promising, but nascent, area of 
research. However, apart from a discus- 
sion of the Facebook Cambridge Analytica 
scandal—an egregious violation of personal 
privacy, but hardly one that constituted war 


A Legged Squad Support System (LS3) accompanies US Marines on patrol. 
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or the “militarization” of data—the author 
neglects such topics and gives short shrift 
to the work of fellow anthropologists such 
as Gabriella Coleman, Mizuko Ito, Daniel 
Miller, and Genevieve Bell, who have done 
pioneering work on these subjects. 

There is already a spirited and conten- 
tious discussion concerning the wisdom 
(not to mention legality and morality) of the 
growing international reliance on emerg- 
ing military technologies. Noel Sharkey—a 
world-renowned computer scientist and 
roboticist whom the author unfortunately 
fails to cite—is, like Gonzalez, highly criti- 
cal of the naive reductionism inherent in 
much artificial intelligence research. He, 
too, is deeply skeptical of our ability to 
equip armed autonomous weapons with the 
requisite capacities to operate reliably in 
the battlespace and, together with philoso- 
pher Peter Asaro, cofounded and leads the 
International Committee for Robot Arms 
Control, whose membership has worked 
tirelessly for well over a decade to pressure 
the United Nations and the International 
Red Cross to ban outright any further de- 
velopment or deployment of lethal autono- 
mous military technologies. 

This vitally important debate has also 
been thoroughly and impartially docu- 
mented by tech-savvy journalists such as 
Ellen Nakashima and Noah Shachtman. 
Rather than seeking common cause with 
like-minded experts and _ organizations, 
however, Gonzalez leans into his deep mis- 
trust of all things military and his apparent 
nostalgia for HTS debates. Readers inter- 
ested in the topic of tech-enabled warfare 
would be better served by consulting the 
scholarship of subject-matter experts such 
as Sharkey or Paul Scharre (1, 2). 
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THE GAPS 


By Laura M. Zahn 


fully sequenced human genome was triumphantly announced more 
than 20 years ago. However, owing to technological limitations, some ge- 
nomic regions remained unresolved. Here, Science presents research by 
the Telomere-to-Ielomere (T2T) Consortium, reporting on the endeavor to 
complete a comprehensive human reference genome. Generated primarily 
by long-read sequencing of a hydatidiform mole, a doubly haploid growth, 
this effort adds ~200 megabases of genetic information—a full chromo- 
some’s worth—to the human genome. 

Through the resolution of previously unsequenceable and unalignable 
regions, mostly composed of highly repetitive sequences, this reference genome 
allows for a detailed characterization of the centromeric satellite repeats, trans- 
posable elements, and segmental duplications. Mapping of genomic sequences, 
including those from previously published studies, resolves aspects of human ge- 
netic diversity, including evolutionary comparisons with our primate relatives. Fur- 
thermore, it allows for identification of how changes in methylation density differ 
within and among centromeres and how epigenetics can affect the transcription of 
repeat sequences. 

These investigations have only begun to tease apart how the T2T reference ge- 
nome influences the detection of biomedically relevant variants and the evolution of 
genomic regions that determine human traits. Although much remains to be discov- 
ered, the T2T reference genome provides another celebratory benchmark to observe 
as we continue to delve into the genetics that underlie our complete selves. 


Access T2T papers published in Science and companion papers at science.org/collections/completing-human-genome. 
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Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of 
the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the 
genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion—base pair sequence 
of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects 
errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene 
predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric 
satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, 
unlocking these complex regions of the genome to variational and functional studies. 


he current human reference genome was 
released by the Genome Reference Con- 
sortium (GRC) in 2013 and most recently 
patched in 2019 (GRCh38.p13) (1). This 
reference traces its origin to the publicly 


funded Human Genome Project (2) and has 
been continually improved over the past two 
decades. Unlike the competing Celera effort 
(3) and most modern sequencing projects 
based on “shotgun” sequence assembly (4), 


the GRC assembly was constructed from se- 
quenced bacterial artificial chromosomes 
(BACs) that were ordered and oriented along 
the human genome by means of radiation hy- 
brid, genetic linkage, and fingerprint maps. 
However, limitations of BAC cloning led to 
an underrepresentation of repetitive sequences, 
and the opportunistic assembly of BACs de- 
rived from multiple individuals resulted in a 
mosaic of haplotypes. As a result, several GRC 
assembly gaps are unsolvable because of in- 
compatible structural polymorphisms on their 
flanks, and many other repetitive and poly- 
morphic regions were left unfinished or in- 
correctly assembled (5). 

The GRCh38 reference assembly contains 
151 mega-base pairs (Mbp) of unknown se- 
quence distributed throughout the genome, 
including pericentromeric and subtelomeric 
regions, recent segmental duplications, ampli- 
conic gene arrays, and ribosomal DNA (rDNA) 
arrays, all of which are necessary for funda- 
mental cellular processes (Fig. 1A). Some of the 
largest reference gaps include human satellite 
(HSat) repeat arrays and the short arms of all 
five acrocentric chromosomes, which are repre- 
sented in GRCh38 as multimegabase stretches 
of unknown bases (Fig. 1, B and C). In addi- 
tion to these apparent gaps, other regions of 
GRCh838 are artificial or are otherwise in- 
correct. For example, the centromeric alpha 
satellite arrays are represented as computa- 
tionally generated models of alpha satellite 
monomers to serve as decoys for resequencing 
analyses (6), and sequence assigned to the 
short arm of chromosome 21 appears falsely 
duplicated and poorly assembled (7). When 
compared with other human genomes, GRCh38 
also shows a genome-wide deletion bias that 
is indicative of incomplete assembly (8). De- 
spite finishing efforts from both the Human 
Genome Project (9) and GRC (7) that improved 
the quality of the reference, there was limited 
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progress toward closing the remaining gaps 
in the years that followed (Fig. 1D). 

Long-read shotgun sequencing overcomes 
the limitations of BAC-based assembly and 
bypasses the challenges of structural poly- 
morphism between genomes. PacBio’s multi- 
kilobase, single-molecule reads (10) proved 
capable of resolving complex structural varia- 
tion and gaps in GRCh88 (8, 17), whereas 
Oxford Nanopore’s >100-kbp “ultralong” reads 
(12) enabled complete assemblies of a human 
centromere (chromosome Y) (73) and, later, an 
entire chromosome (chromosome X) (J4). How- 
ever, the high error rate (>5%) of these tech- 
nologies posed challenges for the assembly 
of long, near-identical repeat arrays. PacBio’s 
most recent “HiFi” circular consensus se- 
quencing offers a compromise of 20-kbp read 
lengths with an error rate of 0.1% (15). Where- 
as ultralong reads are useful for spanning re- 
peats, HiFi reads excel at differentiating subtly 
diverged repeat copies or haplotypes (16). 

To finish the last remaining regions of the 
genome, we leveraged the complementary 
aspects of PacBio HiFi and Oxford Nanopore 
ultralong-read sequencing to assemble the 
uniformly homozygous CHM13hTERT cell 
line (hereafter, CHM13) (17). The resulting 
T2T-CHM13 reference assembly removes a 
20-year-old barrier that has hidden 8% of the 
genome from sequence-based analysis, in- 
cluding all centromeric regions and the entire 
short arms of five human chromosomes. Here, 
we describe the construction, validation, and 
initial analysis of a truly complete human ref- 
erence genome and discuss its potential im- 
pact on the field. 


Cell line and sequencing 


As with many prior reference genome im- 
provement efforts (7, 8, 17-20), including the 
T2T assemblies of human chromosomes X (14) 
and 8 (21), we targeted a complete hydatidiform 
mole (CHM) for sequencing. Most CHM ge- 
nomes arise from the loss of the maternal 
complement and duplication of the paternal 
complement postfertilization and are, there- 
fore, homozygous with a 46,XX karyotype 
(22). Sequencing of CHM13 confirmed near- 
ly uniform homozygosity, with the excep- 
tion of a few thousand heterozygous variants 
and a megabase-scale heterozygous deletion 
within the rDNA array on chromosome 15 
(23) (figs. S1 and S2). Local ancestry analy- 
sis shows that most of the CHM13 genome 
is of European origin, including regions of 
Neanderthal introgression, with some pre- 
dicted admixture (23) (Fig. 1A). Compared 
with diverse samples from the 1000 Genomes 
Project (IKGP) (24), CHM13 possesses no ap- 
parent excess of singleton alleles or loss-of- 
function variants (25). 

We extensively sequenced CHM13 with mul- 
tiple technologies (23), including 30x PacBio 
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—————————————————— es 
Table 1. Comparison of GRCh38 and T2T-CHM13v1.1 human genome assemblies. GRCh38 summary 
statistics exclude “alts” (110 Mbp), patches (63 Mbp), and chromosome Y (58 Mbp). Assembled bases include all 
non-N bases. Unplaced bases are those not assigned or positioned within a chromosome. GRCh38 scaffolds 
were split at three consecutive Ns to obtain the number of contigs. Contig NG5SO is the largest value such that 
contigs of at least this size total more than half of the 3.05-Gbp genome size. The number of exclusive genes or 
transcripts is as follows: for GRCh38, GENCODE genes and transcripts not found in CHM13; and for CHM13, 
extra putative paralogs that are not in GENCODE. Segmental duplication analysis is from (42). RepeatMasker 
analysis is from (49). Blank spaces indicate not applicable. 


STATISTICS GRCH38 T2T-CHM13 DIFFERENCE (+%) 
Summary 
Assembled bases (Gbp) 2.92 3.05 +4.5 
Unplaced bases (Mbp) 11.42 0 -100.0 
Gap bases (Mbp) 120.31 0 -100.0 
Number of contigs 949 24 -97.5 
Contig NG50 (Mbp) 56.41 154.26 +173.5 
Number of issues 230 46 -80.0 
Issues (Mbp) 230.43 8.18 -96.5 
Gene annotation 

Number of genes 60,090 63,494 +5.7 

Protein coding 19,890 19,969 +0.4 
Number of exclusive genes 263 3,604 

Protein coding 63 140 
Number of transcripts 228,597 233,615 42.2 

Protein coding 84,277 86,245 42.3 
Number of exclusive transcripts 1,708 6,693 

Protein coding 829 2,780 

Segmental duplications 
Percentage of segmental duplications (%) 5.00 6.61 
Segmental duplication bases (Mbp) 151.71 201.93 +33.1 
Number of segmental duplications 24097 41528 472.3 
RepeatMasker 

Percentage of repeats (%) 51.89 53.94 
Repeat bases (Mbp) 1,516.37 1,647.81 +87 
Long interspersed nuclear elements 626.33 631.64 +0.8 
Short interspersed nuclear elements 386.48 390.27 +1.0 
Long terminal repeats 267.52 269.91 +0.9 
Satellite 76.51 150.42 +96.6 
DNA 108.53 109.35 +0.8 
Simple repeat 36.5 77.69 +112.9 
Low complexity 6.16 6.44 +4.6 
Retroposon 4.51 4.65 43.3 
rRNA 0.21 171 +730.4 


circular consensus sequencing (HiFi) (16, 20), 
120x Oxford Nanopore ultralong-read se- 
quencing (ONT) (/4, 27), 100x lumina PCR- 
Free sequencing (ILMN) (J), 70x Illumina 
Arima Genomics Hi-C (Hi-C) (/4), BioNano 


optical maps (74), and single-cell DNA template 
strand sequencing (Strand-seq) (20) (table S1). 
To enable assembly of the highly repetitive 
centromeric satellite arrays and closely related 
segmental duplications, we developed methods 
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for assembly, polishing, and validation that 
better utilize these available datasets. 


Genome assembly 


The basis of the T2T-CHM13 assembly is a 
high-resolution assembly string graph (26) 
built directly from HiFi reads. In a bidirected 
string graph, nodes represent unambiguously 
assembled sequences, and edges correspond 
to the overlaps between them, owing to either 
repeats or true adjacencies in the underlying 
genome. The CHM13 graph was constructed 
using a purpose-built method that combines 
components from existing assemblers (16, 27) 
along with specialized graph processing 
(23). Most HiFi errors are small insertions 
or deletions within homopolymer runs and 
simple sequence repeats (16), so homopolymer 
runs were first “compressed” to a single 
nucleotide (e.g., Ay...A,, becomes A, for n > 
1). All compressed reads were then aligned 
to one another to identify and correct small 
errors, and differences within simple se- 
quence repeats were masked. After com- 
pression, correction, and masking, only exact 
read overlaps were considered during graph 
construction, followed by iterative graph sim- 
plification (23). 

In the resulting graph, most components 
originate from a single chromosome and have 
an almost linear structure (Fig. 2A), which 
suggests that few perfect repeats greater than 
roughly 10 kbp exist between different chro- 
mosomes or distant loci. Two notable excep- 
tions are the five acrocentric chromosomes, 
which form a single connected component in 
the graph, and a recent multimegabase HSat3 
duplication on chromosome 9, consistent with 
the 9qh+ karyotype of CHM13 (fig. S3). Minor 
fragmentation of the chromosomes into multi- 
ple components resulted from a lack of HiFi 
sequencing coverage across GA-rich sequences 
(6). These gaps were later filled with a prior 
ONT-based assembly (CHM13v0.7) (74). 

Ideally, the complete sequence for each chro- 
mosome should exist as a walk through the 
string graph where some nodes may be tra- 
versed multiple times (repeats) and some not 
at all (errors and heterozygous variants). To 
help identify the correct walks, we estimated 
coverage depth and multiplicity of the nodes 
(23), which allowed most tangles to be man- 
ually resolved as unique walks visiting each 
node the appropriate number of times (Fig. 2B 
and fig. S4). In the remaining cases, the correct 
path was ambiguous and required integration 
of ONT reads (Fig. 2, C and D). Where possible, 
ONT reads were aligned to candidate traver- 
sals or directly to the HiFi graph (28) to guide 
the correct walk (fig. $5), but more elaborate 
strategies were required for recent satellite 
array duplications on chromosomes 6 and 9 
(23). Only the five rDNA arrays, constituting 
about 10 Mbp of sequence, could not be re- 
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solved with the string graph and required a 
specialized approach (described later). An 
accurate consensus sequence for the selected 
graph walks was computed from the un- 
compressed HiFi reads (23), resulting in the 
CHM13v0.9 draft assembly. 

For comparative genomics of the centro- 
mere (29, 30), we repeated this process on an 
additional X chromosome from the Coriell 
GM24385 cell line [National Institute of Stan- 
dards and Technology (NIST) ID: HG002]. The 
resulting T2T-HG002-ChrX assembly shows 
comparable accuracy to T2T-CHM13 (23) (figs. 
S6 to S8). 


rDNA assembly 


The most complex region of the CHM13 string 
graph involves the human rDNA arrays and 
their surrounding sequence (Fig. 2D). Human 
rDNAs are 45-kbp near-identical repeats that 
encode the 45S rRNA and are arranged in 
large, tandem repeat arrays embedded within 
the short arms of the acrocentric chromosomes. 
The length of these arrays varies between 
individuals (37) and even somatically, espe- 
cially with aging and certain cancers (32). A 
typical diploid human genome has an average 
of 315 rDNA copies, with a standard deviation 
of 104 copies (37). We estimate that the diploid 
CHM13 genome contains about 400 rDNA 
copies based on ILMN depth of coverage (23) 
(fig. S9) or 409 + 9 (mean + SD) rDNA copies 
by droplet digital polymerase chain reaction 
(ddPCR) (fig. S10). 

To assemble these highly dynamic regions 
of the genome and overcome limitations of 
the string graph construction (23) (fig. S11), 
we constructed sparse de Bruijn graphs for 
each of the five rDNA arrays (33) (fig. S12). 
ONT reads were aligned to the graphs to iden- 
tify a set of walks, which were converted to 
sequence, segmented into individual rDNA 
units, and clustered into “morphs” according 
to their sequence similarity. The copy number 
of each morph was estimated from the num- 
ber of supporting ONT reads, and consensus 
sequences were polished with mapped HiFi 
reads. ONT reads spanning two or more 
rDNA units were used to build a morph graph 
representing the structure of each array 
(fig. S12). 

The shorter arrays on chromosomes 14 and 
22 consist of a single primary morph arranged 
in a head-to-tail array, whereas the longer 
arrays on chromosomes 13, 15, and 21 exhibit 
a more mosaic structure involving multiple, 
interspersed morphs. In these cases, the ONT 
reads were not long enough to fully resolve the 
ordering, and the primary morphs were arti- 
ficially arranged in consecutive blocks reflect- 
ing their estimated copy number. These three 
arrays capture the chromosome-specific morphs 
but should be treated as model sequences. 
The final T2T-CHM13 assembly contains 219 


complete rDNA copies, totaling 9.9 Mbp of 
sequence. 


Assembly validation and polishing 


To evaluate concordance between the reads 
and the assembly, we mapped all available 
primary data—including HiFi, ONT, ILMN, 
Strand-seq, and Hi-C—to the CHM13Vv0.9 draft 
assembly to identify both small and structural 
variants [see (34) for a complete description]. 
Manual curation corrected 4 large and 993 
small errors, resulting in the CHM13v1.0 as- 
sembly, and identified 44 large and 3901 small 
heterozygous variants (34). Further telomere 
polishing and addition of the rDNA arrays (23) 
resulted in a complete, telomere-to-telomere 
assembly of a human genome, T2T-CHM13v1.1. 

The T2T-CHM13 assembly is consistent 
with previously validated assemblies of chro- 
mosomes X (14) and 8 (27), and the sizes of 
assembled satellite arrays match ddPCR 
copy-number estimates for those tested (fig. 
S10 and tables S2 and S3). Mapped Strand-seq 
(figs. S13 and S14) and Hi-C (fig. S15) data 
show no signs of misorientations or other 
large-scale structural errors. The assembly 
correctly resolves 644 of 647 previously se- 
quenced CHM13 BACs at >99.99% identity, 
with the three others reflecting errors in the 
BACs themselves (figs. S16 to S19). 

Mapped sequencing read depth shows uni- 
form coverage across all chromosomes (Fig. 3A), 
with 99.86% of the assembly within three stan- 
dard deviations of the mean coverage for either 
HiFi or ONT (HiFi coverage 34.70 + 7.03 and 
ONT coverage 116.16 + 16.96, excluding the 
mitochondrial genome). Ignoring the 10 Mbp 
of rDNA sequence, where most of the coverage 
deviation resides, 99.99% of the assembly is 
within three standard deviations (23). Align- 
ment-free analysis of ILMN and HiFi copy- 
number data also shows concordance with 
the assembly (figs. S20 and S21). This is con- 
sistent with uniform coverage of the genome 
and confirms both the accuracy of the assem- 
bly and the absence of aneuploidy in the se- 
quenced CHM13 cells. 

Coverage increases or decreases were ob- 
served across multiple satellite arrays (Fig. 3, 
B to D). However, given the uniformity of cov- 
erage across these arrays, association with spe- 
cific satellite classes, and the sometimes opposite 
effect observed for HiFi and ONT, we hypoth- 
esize that these anomalies are related to biases 
introduced during sample preparation, sequenc- 
ing, or base calling, rather than assembly error 
(23) (figs. S22 to S26 and table S4). Although 
the specific mechanisms require further inves- 
tigation, prior studies have noted similar biases 
within certain satellite arrays and sequence 
contexts for both ONT and HiFi (35, 36). 

Because they are the most difficult regions 
of the genome to assemble, we performed tar- 
geted validation of long tandem repeats to 
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identify any errors missed by the genome-wide 
approach. The assembled rDNA morphs, being 
only 45 kbp each, were manually validated by 
inspection of the read alignments used for 
polishing. Alpha satellite higher-order repeats 
(HORs) were validated using a purpose-built 
method (37) (fig. S27 and table S5) and com- 
pared with independent ILMN-based HOR 
copy-number estimates (fig. S28). All centro- 
meric satellite arrays, including beta satellite 


(BSat) and HSat repeats, were further vali- 
dated by measuring the ratio of primary to 
secondary variants identified by HiFi reads 
(38) (fig. S29). 

The consensus accuracy of the T2T-CHM13 
assembly is estimated to be about one error 
per 10 Mbp (23, 34), which exceeds the his- 
torical standard of “finished” sequence by 
orders of magnitude. However, regions of low 
HiFi coverage were found to be associated 


with an enrichment of potential errors, as es- 
timated from both HiFi and ILMN data (34). 
To guide future use of the assembly, we have 
cataloged all low-coverage, low-confidence, 
and known heterozygous sites identified by 
the above validation procedures (34). The total 
number of bases covered by potential issues in 
the T2T-CHM13 assembly is just 0.3% of the 
total assembly length compared with 8% for 
GRCh38 (Fig. 3A). 


Fig. 2. High-resolution assembly string graph of the CHM13 genome. 

(A) Bandage (60) visualization, where nodes represent unambiguously 
assembled sequences scaled by length and edges correspond to the overlaps 
between node sequences. Each chromosome is both colored and numbered on the 
short (p) arm. Long (q) arms are labeled where unclear. The five acrocentric 
chromosomes (bottom right) are connected owing to similarity between their short 
arms, and the rDNA arrays form five dense tangles because of their high copy 
number. The graph is partially fragmented because of HiFi coverage dropout 
surrounding GA-rich sequence (black triangles). Centromeric satellites (30) are 
the source of most ambiguity in the graph (gray highlights). MT, mitochondria. 
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(B) The ONT-assisted graph traversal for the 2p11 locus is given by numerical order. 
Based on low depth of coverage, the unlabeled light gray node represents an 
artifact or heterozygous variant and was not used. (C) The multimegabase tandem 
HSat3 duplication (9qgh+) at 9q12 requires two traversals of the large loop structure. 
(The size of the loop is exaggerated because graph edges are of constant size.) 
Nodes used by the first traversal are in dark purple, and nodes used by the second 
traversal are in light purple. Nodes used by both traversals typically have twice 

the sequencing coverage. (D) Enlargement of the distal short arms of the 
acrocentrics, showing the colored graph walks and edges between highly similar 
sequences in the distal junctions (DJs) adjacent to the rDNA arrays. 


science.org SCIENCE 


RESEARCH | COMPLETING THE HUMAN GENOME 


A truly complete genome 

T2T-CHM13 includes gapless telomere-to- 
telomere assemblies for all 22 human au- 
tosomes and chromosome X, comprising 
3,054,815,472 bp of nuclear DNA, plus a 
16,569-bp mitochondrial genome. This com- 
plete assembly adds or corrects 238 Mbp of 
sequence that does not colinearly align to 
GRCh38 over a 1-Mbp interval (i.e., is non- 
syntenic), primarily comprising centromeric 
satellites (76%), nonsatellite segmental dupli- 
cations (19%), and rDNAs (4%) (Fig. 1C). Of 
this, 182 Mbp of sequence has no primary 
alignments to GRCh38 and is exclusive to 
T2T-CHM13. As a result, T2T-CHM13 increases 
the number of known genes and repeats in the 
human genome (Table 1). 

To provide an initial annotation, we used 
both the Comparative Annotation Toolkit 
(CAT) (39) and Liftoff (40) to project the 
GENCODE v35 (41) reference annotation 
onto the T2T-CHM13 assembly. Addition- 
ally, CHM13 full-length isoform sequencing 
(Iso-seq) transcriptome reads were assembled 
into transcripts and provided as complemen- 
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tary input to CAT. A comprehensive annotation 
was built by combining the CAT annotation 
with genes identified only by Liftoff (23). 

The draft T2T-CHM13 annotation totals 
63,494 genes and 233,615 transcripts, of which 
19,969 genes (86,245 transcripts) are pre- 
dicted to be protein coding, with 683 predicted 
frameshifts in 385 genes (469 transcripts) 
(Table 1, fig. S30, and tables S6 to S8). Only 263 
GENCODE genes (448 transcripts) are exclu- 
sive to GRCh38 and have no assigned ortholog 
in the CHM13 annotation (tables S9 and S10). 
Of these, 194 are due to a lower copy number 
in the CHM13 annotation (fig. S31), 46 do not 
align well to CHM13, and 23 correspond to 
known false duplications in GRCh38 (25) 
(fig. S32). Most of these genes are noncoding 
and associated with repetitive elements. Only 
four are annotated as being medically relevant 
(CFHR1, CFHR3, ORSIA2, UGT2B28), all of 
which are absent owing to a copy number dif- 
ference, and the only protein-coding genes 
that align poorly are immunoglobulin and T cell 
receptor genes, which are known to be highly 
diverse. 
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In comparison, a total of 3604 genes (6693 
transcripts) are exclusive to CHM13 (tables S11 
and S12). Most of these genes represent puta- 
tive paralogs and localize to pericentromeric 
regions and the short arms of the acrocentrics, 
including 876 rRNA transcripts. Only 48 of the 
CHM13-exclusive genes (56 transcripts) were 
predicted solely from de novo assembled tran- 
scripts. Of all genes exclusive to CHM13, 14.0 are 
predicted to be protein coding based on their 
GENCODE paralogs and have a mean of 99.5% 
nucleotide and 98.7% amino acid identity to 
their most similar GRCh38 copy (table S13). Al- 
though some of these additional paralogs may 
be present (but unannotated) in GRCh38 (23), 
1956 of the genes exclusive to CHM13 (99 pro- 
tein coding) are in regions with no primary 
alignment to GRCh38 (table S11). A broader 
set of 182 multi-exon protein-coding genes 
fall within nonsyntenic regions, 36% of which 
were confirmed to be expressed in CHM13 (42). 

Compared with GRCh38, T2T-CHM13 is a 
more complete, accurate, and representative 
reference for both short- and long-read variant 
calling across human samples of all ancestries 
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Fig. 3. Sequencing coverage and assembly validation. (A) Uniform whole- 
genome coverage of mapped HiFi and ONT reads is shown with primary 
alignments in light shades and marker-assisted alignments overlaid in dark 
shades. Large HSat arrays (30) are noted by triangles, with inset regions marked 
by arrowheads and the location of the rDNA arrays marked with asterisks. 
Regions with low unique marker frequency (light green) correspond to drops 

in unique marker density but are recovered by the lower-confidence primary 
alignments. Annotated assembly issues are compared for T2T-CHM13 and 
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GRCh38. Hets, heterozygous variants; k, marker size. (B to D) Enlargements 
corresponding to regions of the genome featured in Fig. 2, B to D, respectively. 
Uniform coverage changes within certain satellites are reproducible and likely 
caused by sequencing bias. Identified heterozygous variants and assembly issues 
are marked below and typically correspond with low coverage of the primary 
allele (black) and increased coverage of the secondary allele (red). The 
percentage of microsatellite repeats for every 128-bp window is shown at the 
bottom. dHOR, divergent HOR; mon, monomeric. 
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(25). Reanalysis of 3202 short-read datasets 
from the 1KGP showed that T2T-CHM13 
simultaneously reduces both false-negative 
and false-positive variant calls because of the 
addition of 182 Mbp of missing sequence and 
the exclusion of 1.2 Mbp of falsely duplicated 
sequence in GRCh38. These improvements, 
combined with a lower frequency of rare 
variants and errors in T2T-CHM13, eliminate 
tens of thousands of spurious variants per 
1KGP sample (25). In addition, the T2T-CHM13 
reference was found to be more representative 
of human copy-number variation than GRCh38 
when compared against 268 human genomes 
from the Simons Genome Diversity Project 
(SGDP) (42, 43). Specifically, within non- 
syntenic segmentally duplicated regions of 
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the genome, T2T-CHM13 is nine times more 
predictive of SGDP copy number than GRCh38 
(42). These results underscore both the quality 
of the assembly and the genomic stability of 
the cell line from which it was derived. 


Acrocentric chromosomes 


T2T-CHM13 uncovers the genomic structure 
of the short arms of the five acrocentric chro- 
mosomes, which, despite their importance 
for cellular function (44), have remained 
largely unsequenced to date. This omission 
has been due to their enrichment for satellite 
repeats and segmental duplications, which 
has prohibited sequence assembly and lim- 
ited their characterization to cytogenetics, 
restriction mapping, and BAC sequencing 
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Fig. 4. Short arms of the acrocentric chromosomes. Each short arm is 
shown along with annotated genes, percentage of methylated CpGs (29), and a 
color-coded satellite repeat annotation (30). The rDNA arrays are represented by 
a directional arrow and copy number because of their high self-similarity, which 
prohibits ONT mapping. Percent identity heatmaps versus the other four arms 
were computed in 10-kbp windows and smoothed over 100-kbp intervals. Each 
position shows the maximum identity of that window to any window in the other 
chromosome. The distal short arms include conserved satellite structure and 
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(45-47). All five of CHM13’s short arms follow 
a similar structure consisting of an rDNA 
array embedded within distal and proximal 
repeat arrays (Fig. 4). From telomere to centro- 
mere, the short arms vary in size from 10.1 Mbp 
(chromosome 14) to 16.7 Mbp (chromosome 15), 
with a combined length of 66.1 Mbp. 
Compared with other human chromosomes, 
the short arms of the acrocentrics are un- 
usually similar to one another. Specifically, we 
find that 5-kbp windows align with a median 
identity of 98.7% between the short arms, 
creating many opportunities for interchro- 
mosomal exchange (Fig. 4). This high degree 
of similarity is presumably due to recent non- 
allelic or ectopic recombination stemming 
from their colocalization in the nucleolus 
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inverted repeats (thin arrows), whereas the proximal short arms show a diversity 
of structures. The proximal short arms of chromosomes 13, 14, and 21 share a 
segmentally duplicated core, including small alpha satellite HOR arrays and a central, 
highly methylated SST1 array (thin arrows with teal block). Yellow triangles 
indicate hypomethylated centromeric dip regions (CDRs), marking the sites of 
kinetochore assembly (29). Numbers in parentheses indicate rDNA copy number. 
ACRO, acrocentric repeat; CER, centromeric repeat; DJ, distal junction; PJ, proximal 
junction; SD, segmental duplication. 
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Fig. 5. Resolved FRG] paralogs. (A) Protein-coding gene FRG1 and its 23 paralogs 
in CHM13. Only nine are found in GRCh38. Genes are drawn larger than their 
actual size, and the “FRG” prefix is omitted for brevity. All paralogs are found near 
satellite arrays. Most copies exhibit evidence of expression, including CpG islands 
present at the 5’ start site with varying degrees of methylation. (B) Reference 
(gray) and variant (colored) allele coverage is shown for four human HiFi samples 
mapped to the paralog FRGIDP. When mapped to GRCh38, the region shows 
excessive HiFi coverage and variants, indicating that reads from the missing 
paralogs are mismapped to FRGIDP (variants >80% frequency shown). 

When mapped to CHM13, HiFi reads show the expected coverage and a typical 
heterozygous variation pattern for the three non-CHM13 samples (variants 
>20% frequency shown). These nonreference alleles are also found in other 
populations from IKGP ILMN data. NonRef AF, nonreference allele frequency; AFR, 
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African; AMR, ad-mixed American; EAS, East Asian; EUR, European; SAS, South 
Asian. (C) Mapped HiFi read coverage for other FRG1 paralogs, with an extended 
context shown for chromosome 20. Coverage of HiFi reads that mapped 

to FRGIDP in GRCh38 is highlighted (dark gray), showing the paralogous copies 
they originate from (FRGIBP4 to FRGIBP10, FRGIGP, FRGIGP2, and FRGIKP4). 
Background coverage is variable for some paralogs, suggesting the presence 

of copy-number polymorphism in the population. (D) Methylation and expression 
profiles suggest transcription of FRGIDP in CHM13. In the copy-number display 
(bottom), 100-bp windows from the CHM13 assembly are highlighted with a color 
representing the estimated copy number of that sequence in an SGDP sample. The 
CHM13 and GRCh38 tracks show the estimated copy number of these same 
sequences in the respective assemblies. CHM13 copy number resembles all samples 
from the SGDP, whereas GRCh38 underrepresents the true copy number. 


1 APRIL 2022 « VOL 376 ISSUE 6588 51 


SPECIAL SECTION 


COMPLETING THE HUMAN GENOME 


(46). Additionally, considering an 80% iden- 
tity threshold, no 5-kbp window on the short 
arms is unique, and 96% of the non-rDNA se- 
quence can be found elsewhere in the genome, 
suggesting that the acrocentrics are dynamic 
sources of segmental duplication. 

The rDNA arrays of CHM13 vary in size 
from 0.7 Mbp (chromosome 14) to 3.6 Mbp 
(chromosome 13) and are in the expected 
arrangement, organized as head-to-tail tan- 
dem arrays with all 45S transcriptional units 
pointing toward the centromere. No inver- 
sions were noted within the arrays, and near- 
ly all rDNA units are full length, in contrast 
to some prior studies that reported em- 
bedded inversions and other noncanonical 
structures (47, 48). Each array appears high- 
ly homogenized, and there is more variation 
between rDNA units on different chromo- 
somes than within chromosomes (fig. $33), 
suggesting that intrachromosomal exchange 
of rDNA units through nonallelic homologous 
recombination is more common than inter- 
chromosomal exchange. 

Many 45S gene copies on the same chromo- 
some are identical to one another, whereas 
the identity of the most frequent 45S morphs 
between chromosomes ranges from 99.4 to 
99.7%. A chromosome 15 rDNA morph shows 
the highest identity (98.9%) to the current 
KY962518.1 rDNA reference sequence, orig- 
inally derived from a human chromosome 21 
BAC clone (47). As expected, the 13-kbp 45S 
is more conserved than the intergenic spacer, 
with all major 45S morphs aligning between 
99.4 and 99.6% identity to KY962518.1. Certain 
rDNA variants appear to be chromosome 
specific, including single-nucleotide variants 
within the 45S and its upstream promoter 
region (fig. S34). The most evident variants 
are repeat expansions and contractions within 
the tandem “R” repeat that immediately fol- 
lows the 45S and the CT-rich “long” repeat 
located in the middle of the intergenic spacer. 
The most frequent morph in each array can be 
specifically distinguished by these two fea- 
tures (fig. S35). 

From the telomere to the rDNA array, the 
structure of all five distal short arms follows 
a similar pattern that involves a symmetric 
arrangement of inverted segmental dupli- 
cations and acrocentric, HSat3, BSat, and 
HSatl repeats (Fig. 4); however, the sizes of 
these repeat arrays vary among chromosomes. 
Chromosome 13 is missing the distal half of 
the inverted duplication and has an expanded 
HSat1 array relative to the others. Despite 
their variability in size, all satellite arrays share 
a high degree of similarity (typically >90% 
identity) both within and between acrocen- 
tric chromosomes. Chromosomes 14 and 22 
also feature the expansion of a 64-bp Alu- 
associated satellite repeat (“Walu”) within 
the distal inverted duplication (49), the loca- 
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tion of which was confirmed by fluorescence 
in situ hybridization (FISH) (fig. S36). The 
distal junction immediately before the rDNA 
array includes centromeric repeats and a 
highly conserved and actively transcribed 
200-kbp palindromic repeat, which agrees 
with previous characterizations of the rDNA 
flanking sequences (46, 50). 

Extending from the rDNA array to the cen- 
tromere, the proximal short arms are larger in 
size and show a higher diversity of structures, 
including shuffled segmental duplications 
(42), composite transposable element arrays 
(49), satellite arrays (including HSat3, BSat, 
HSatl, and HSat5), and alpha satellite arrays 
(both monomeric and HORs) (30). Some prox- 
imal BSat arrays show a mosaic inversion 
structure that was also observed in HSat arrays 
elsewhere in the genome (30) (fig. S37). The 
proximal short arms of chromosomes 13, 14, 
and 21 appear to share the highest degree of 
similarity with a large region of segmental 
duplication, including similar HOR subsets 
and a central and highly methylated SST1 
array (Fig. 4). This coincides with these three 
chromosomes being most frequently involved 
in Robertsonian translocations (57). Alpha sat- 
ellite HORs on chromosomes 13 and 21 and 
chromosomes 14 and 22 also share high sim- 
ilarity within each pair, but not between them 
(52, 53). Nonsatellite sequences within these 
segmental duplications often exceed 99% 
identity and show evidence of transcription 
(29, 42, 49). Using the T2T-CHM13 reference 
as a basis, further study of additional genomes 
is now needed to understand which of these 
features are conserved across the human 
population. 


Analyses and resources 


A number of companion studies were carried 
out to characterize the complete sequence of 
a human genome, including comprehensive 
analyses of centromeric satellites (30), seg- 
mental duplications (42), transcriptional (49) 
and epigenetic profiles (29), mobile elements 
(49), and variant calls (25). Up to 99% of the 
complete CHM13 genome can be confidently 
mapped with long-read sequencing, opening 
these regions of the genome to functional and 
variational analysis (23) (fig. S38 and table 
$14). We have produced a rich collection of 
annotations and omics datasets for CHM13— 
including RNA sequencing (RNA-seq) (30), 
Iso-seq (21), precision run-on sequencing 
(PRO-seq) (49), cleavage under targets and 
release using nuclease (CUT&RUN) (30), and 
ONT methylation (29) experiments—and have 
made these datasets available via a centralized 
University of California, Santa Cruz (UCSC), 
Assembly Hub genome browser (54). 

To highlight the utility of these genetic and 
epigenetic resources mapped to a complete 
human genome, we provide the example of 


a segmentally duplicated region of the chro- 
mosome 4q subtelomere that is associated 
with facioscapulohumeral muscular dystrophy 
(FSHD) (55). This region includes FSHD re- 
gion gene 1 (FRGD, FSHD region gene 2 (FRG2), 
and an intervening D474 macrosatellite repeat 
containing the double homeobox 4 (DUX4) 
gene that has been implicated in the etiology 
of FSHD (56). Numerous duplications of this 
region throughout the genome have compli- 
cated past genetic analyses of FSHD. 

The T2T-CHM13 assembly reveals 23 paral- 
ogs of FRGI spread across all acrocentric chro- 
mosomes as well as chromosomes 9 and 20 
(Fig. 5A). This gene appears to have undergone 
recent amplification in the great apes (57), and 
approximate locations of FRGI paralogs were 
previously identified by FISH (58). However, 
only nine FRGI paralogs are found in GRCh38, 
hampering sequence-based analysis. 

One of the few FRGI paralogs included in 
GRCh38, FRGIDP, is located in the centro- 
meric region of chromosome 20 and shares 
high identity (97%) with several paralogs 
(FRGIBP4 to FRGIBP10) (23) (fig. S39 and 
tables S15 and S16). When mapping HiFi 
reads, the absence of the additional FRGI 
paralogs in GRCh38 causes their reads to 
incorrectly align to FRGIDP, resulting in many 
false-positive variants (Fig. 5B). Most FRGI 
paralogs appear present in other human ge- 
nomes (Fig. 5C), and all except FRGIKP2 and 
FRGIKP3 have upstream CpG islands and 
some degree of expression evidence in CHM13 
(Fig. 5D and table S17). Any variants within 
these paralogs, and others like them, will be 
overlooked when using GRCh38 as a reference. 


Future of the human reference genome 


The T2T-CHM13 assembly adds five full chro- 
mosome arms and more additional sequence 
than any genome reference release in the past 
20 years (Fig. 1D). This 8% of the genome has 
not been overlooked because of a lack of im- 
portance but rather because of technological 
limitations. High-accuracy long-read sequenc- 
ing has finally removed this technological 
barrier, enabling comprehensive studies of 
genomic variation across the entire human 
genome, which we expect to drive future dis- 
covery in human genomic health and disease. 
Such studies will necessarily require a com- 
plete and accurate human reference genome. 
CHM13 lacks a Y chromosome, and homo- 
zygous Y-bearing CHMs are nonviable, so a dif- 
ferent sample type will be required to complete 
this last remaining chromosome. However, 
given its haploid nature, it should be possible 
to assemble the Y chromosome from a male 
sample using the same methods described here 
and supplement the T2T-CHM13 reference as- 
sembly with a Y chromosome as needed. 
Extending beyond the human reference ge- 
nome, large-scale resequencing projects have 
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revealed genomic variation across human popu- 
lations. Our reanalyses of the IKGP (25) and 
SGDP (42) datasets have already shown the 
advantages of T2T-CHM13, even for short-read 
analyses. However, these studies give only a 
glimpse of the extensive structural variation that 
lies within the most repetitive regions of the ge- 
nome assembled here. Long-read resequencing 
studies are now needed to comprehensively 
survey polymorphic variation and reveal any 
phenotypic associations within these regions. 

Although CHM13 represents a complete 
human haplotype, it does not capture the 
full diversity of human genetic variation. To 
address this bias, the Human Pangenome 
Reference Consortium (59) has joined with 
the T2T Consortium to build a collection of 
high-quality reference haplotypes from a 
diverse set of samples. Ideally, all genomes 
could be assembled at the quality achieved 
here, but automated T2T assembly of diploid 
genomes presents a difficult challenge that 
will require continued development. Until this 
goal is realized, and any human genome can 
be completely sequenced without error, the 
T2T-CHM13 assembly represents a more com- 
plete, representative, and accurate reference 
than GRCh38. 
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INTRODUCTION: One of the central applications 
of the human reference genome has been to 
serve as a baseline for comparison in nearly all 
human genomic studies. Unfortunately, many 
difficult regions of the reference genome have 
remained unresolved for decades and are af- 
fected by collapsed duplications, missing sequen- 
ces, and other issues. Relative to the current 
human reference genome, GRCh38, the Telomere- 
to-Telomere CHM13 (T2T-CHMI13) genome 
closes all remaining gaps, adds nearly 200 mil- 
lion base pairs (Mbp) of sequence, corrects 
thousands of structural errors, and unlocks the 
most complex regions of the human genome for 
scientific inquiry. 


RATIONALE: We demonstrate how the T2T- 
CHM13 reference genome universally improves 
read mapping and variant identification in a 
globally diverse cohort. This cohort includes all 
3202 samples from the expanded 1000 Genomes 


Project (IKGP), sequenced with short reads, as 
well as 17 globally diverse samples sequenced 
with long reads. By applying state-of-the-art 
methods for calling single-nucleotide variants 
(SNVs) and structural variants (SVs), we docu- 
ment the strengths and limitations of T2T-CHM13 
relative to its predecessors and highlight its pro- 
mise for revealing new biological insights within 
technically challenging regions of the genome. 


RESULTS: Across the 1KGP samples, we found 
more than 1 million additional high-quality 
variants genome-wide using T2T-CHM13 than 
with GRCh38. Within previously unresolved 
regions of the genome, we identified hundreds 
of thousands of variants per sample—a pro- 
mising opportunity for evolutionary and bio- 
medical discovery. T2T-CHM13 improves the 
Mendelian concordance rate among trios and 
eliminates tens of thousands of spurious SNVs 
per sample, including a reduction of false 


Comparison of GRCh38 and T2T-CHM13 (chr1) 


positives in 269 challenging, medically relevant 
genes by up to a factor of 12. These corrections 
are in large part due to improvements to 70 
protein-coding genes in >9 Mbp of inaccurate 
sequence caused by falsely collapsed or dupli- 
cated regions in GRCh38. Using the T2T-CHM13 
genome also yields a more comprehensive view 
of SVs genome-wide, with a greatly improved 
balance of insertions and deletions. Finally, by 
providing numerous resources for T2T-CHM13 
(including IKGP genotypes, accessibility masks, 
and prominent annotation databases), our 
work will facilitate the transition to T2T-CHM13 
from the current reference genome. 


CONCLUSION: The vast improvements in vari- 
ant discovery across samples of diverse ances- 
tries position T2T-CHM13 to succeed as the 
next prevailing reference for human genetics. 
T2T-CHM13 thus offers a model for the con- 
struction and study of high-quality reference 
genomes from globally diverse individuals, such 
as is now being pursued through collaboration 
with the Human Pangenome Reference Consor- 
tium. As a foundation, our work underscores the 
benefits of an accurate and complete reference 
genome for revealing diversity across human 
populations. 
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Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of 
sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human 
genome for clinical and functional study. We show how this reference universally improves read mapping and 
variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We 
identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the 
promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference 
eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 
medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with 
population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing 


reference for human genetics. 


or the past 20 years, the human reference 
genome (GRCh38) has served as the bed- 
rock of human genetics and genomics 
(7-3). One of the central applications of 
the human reference genome, and of ref- 
erence genomes in general, has been to serve 
as a substrate for clinical, comparative, and 
population genomic analyses. More than 1 mil- 
lion human genomes have been sequenced to 
study genetic diversity and clinical relation- 
ships, and nearly all of them have been analy- 
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zed by aligning the sequencing reads from the 
donors to the reference genome [e.g., (4-6)]. 
Even when donor genomes are assembled de 
novo, independent of any reference, the as- 
sembled sequences are almost always com- 
pared to a reference genome to characterize 
variation by leveraging deep catalogs of avail- 
able annotations (7, 8). Consequently, human 
genetics and genomics benefit from the availa- 
bility of a high-quality reference genome, ideally 
without gaps or errors that may obscure im- 
portant variation and regulatory relationships. 

The current human reference genome, 
GRCh38, is used for countless applications, 
with rich resources available to visualize and 
annotate the sequence across cell types and 
disease states (3, 9-12). However, despite deca- 
des of effort to construct and refine its sequence, 
the human reference genome still suffers from 
several major limitations that hinder compre- 
hensive analysis. Most immediately, GRCh38 
contains more than 100 million nucleotides 
that either remain entirely unresolved (cur- 
rently represented as “N” characters), such as 
the p-arms of the acrocentric chromosomes, 
or are substituted with artificial models, such 
as the centromeric satellite arrays (13). Further- 
more, GRCh38 possesses 11.5 Mbp of unplaced 
and unlocalized sequences that are represented 
separately from the primary chromosomes 
(3, 14). These sequences are difficult to study, 
and many genomic analyses exclude them to 
avoid identifying false variants and false reg- 
ulatory relationships (6). Relatedly, artifacts 
such as an apparent imbalance between insert- 
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ions and deletions (indels) have been attributed 
to systematic misassemblies in GRCh38 (15-17). 
Overall, these errors and omissions in GRCh38 
introduce biases in genomic analyses, particu- 
larly in centromeres, satellites, and other com- 
plex regions. 

Another major concern regards the influ- 
ence of the reference genome on analysis of 
variation across large cohorts for population 
and clinical genomics. Several studies, such 
as the 1000 Genomes Project (IKGP) (78) and 
gnomAD (6), have provided information about 
the extent of genetic diversity within and 
between human populations. Many analyses 
of Mendelian and complex diseases use these 
catalogs of single-nucleotide variants (SNVs), 
small indels, and structural variants (SVs) to 
rank and prioritize potential causal variants 
on the basis of allele frequencies (AFs) and 
other evidence (19-27). When evaluating these 
resources, the overall quality and representa- 
tiveness of the human reference genome 
should be considered. Any gaps or errors in 
the sequence could obscure variation and its 
contribution to human phenotypes and disease. 

In addition to omissions such as centro- 
meric sequences or acrocentric chromosome 
arms, the current reference genome possesses 
other errors and biases, including within genes 
of known medical relevance (22, 23). Moreover, 
GRCh38 was assembled from multiple donors 
with clone-based sequencing, which creates an 
excess of artificial haplotype structures that can 
subtly bias analyses (J, 24). Over the years, there 
have been attempts to replace certain rare alleles 
with more common alleles, but hundreds of 
thousands of artificial haplotypes and rare 
alleles remain to this day (3, 25, 26). Increasing 
the continuity, quality, and representativeness 
of the reference genome is therefore crucial 
for improving genetic diagnosis, as well as 
for understanding the complex relationship 
between genetic and phenotypic variation. 

The Telomere-to-Telomere (T2T) CHM13 
genome addresses many of the limitations of 
the current reference (27). Specifically, the 
T2T-CHM13v1.0 assembly adds nearly 200 Mbp 
of sequence and resolves errors present in 
GRCh38. Here, we demonstrate the impact of 
the T2T-CHM13 reference on variant discovery 
and genotyping in a globally diverse cohort. This 
includes all 3202 samples from the recently 
expanded IKGP sequenced with short reads (28) 
along with 17 samples from diverse popula- 
tions sequenced with long reads (8, 27, 29). 
Our analysis reveals more than 2 million variants 
within previously unresolved regions of the 
genome, genome-wide improvements in SV 
discovery, and enhancement in variant calling 
accuracy across 622 medically relevant genes. 
In summary, our work demonstrates universal 
improvements in read mapping and variant 
calling, thereby broadening the horizon for 
future genomic studies. 
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Structural comparisons of GRCh38 and 
T2T-CHM13 

Introducing the T2T-CHM13 genome 

The T2T-CHM13 reference genome was primar- 
ily assembled from Pacific Biosciences (PacBio) 
High Fidelity (HiFi) reads augmented with 
Oxford Nanopore Technology (ONT) reads to 
close gaps and resolve complex repeats (27). 
The resulting T2T-CHM13v1.0 assembly was 
subsequently validated and polished, with a 
consensus accuracy estimated to be between 
Phred Q67 and Q73 (27, 30) and with only three 
minor known structural defects detected (30). 
The assembly is highly contiguous, with only 
five unresolved regions from the most highly 
repetitive ribosomal DNA (rDNA) arrays, re- 
presenting only 9.9 Mbp of sequence out of 
>3.0 Gbp of fully resolved sequence. The ver- 
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sion 1.0 assembly adds or revises 229 Mbp of 
sequence compared to GRCh38; these are “non- 
syntenic” regions of the T2T-CHM13 assembly 
that do not linearly align to GRCh38 over a 1-Mbp 
interval. Furthermore, 189 Mbp of sequence are 
not covered by any primary alignments from 
GRCh38 and are resolved in the T2T-CHM13 
assembly. Figure 1A is a summary diagram of the 
syntenic/nonsyntenic regions and their associated 
annotations for chromosomes 1 and 21 (figs. S1 to 
S4 give details for all chromosomes). Note that 
the subsequent T2T-CHM13v1.1 assembly (27) 
further resolves the rDNA regions using model 
sequences for some array elements, although 
for this study we analyze the v1.0 assembly, 
which does not contain these representations. 
The bulk of the nonsyntenic sequence within 
T2T-CHM13 comprises centromeric satellites 
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(190 Mbp) (3D) and copies of segmental dup- 
lications (SDs; 218 Mbp) (32). These sequences 
could prove challenging for variant analysis, 
especially for variants identified using short- 
read sequencing. However, relative to GRCh38, 
we report an overall increase in unique se- 
quence, defined as k-length strings (k-mers) 
found only once in the genome (e.g., 14.9 Mbp 
of added unique sequence when considering 
50-mers, 23.5 Mbp for 100-mers, and 39.5 Mbp 
for 300-mers). These sequences delineate re- 
gions of confident mapping for short paired- 
end reads or longer reads, including those in 
previously unrepresented portions of the ge- 
nome (Fig. 1B and figs. S5 and S6). 

More than 106 Mbp of sequence absent from 
GRCh38 was identified in T2T-CHM13 within 
highly repetitive regions that require reads of 
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Fig. 1. Genomic comparisons of human assemblies GRCh38 and T2T- 
CHM13. (A) Overview of annotations available for GRCh38 and T2T-CHM13 
chromosomes 1 and 21 with colors indicated in legends, which are also used 
in (B) to (D). Colors for mean minimum (min) unique k-mers are defined in 
the legend with indicated asterisk. Cytobands are pictured as gray bands with 
red bands representing centromeric regions within ideograms. Complete 
annotations of all chromosomes can be found in figs. Sl to S4. Local ancestry 
is denoted using IKGP superpopulation abbreviations (AFR, African; AMR, 
admixed American; EAS, East Asian; EUR, European; SAS, South Asian). 

(B) Summary of the number of bases and/or genes annotated by different 
features for the assemblies with colors indicated in the legends shown in 
(A). Note, dbSNP liftover failures (pink) are not annotated in (A). (C) Example 
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of a clone boundary (red line) where GRCh38 possesses a combination of 
alleles that segregate in negative LD within the IKGP sample (which we term 
as an “LD-discordant haplotype”). SNPs are depicted in columns; phased 
1KGP samples are depicted in rows. White indicates reference allele genotype; 
black indicates alternative allele genotypes. Superpopulation ancestry of each 
sample is indicated in the rightmost column with colors indicated in local 
ancestry legend shown in (A). CEP104 splice isoforms (blue) are depicted 
at the bottom. (D) Tally of such LD-discordant haplotypes in a selection of 
1KGP individuals, colored by population, as well as GRCh38 and T2T-CHM13. 
(E) Examples of variants that cannot be lifted over to T2T-CHM13 because 
of structural differences between the genomes. The position of the reference 
allele in GRCh38 is shown in red. 
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more than 300 bp to uniquely map. Concom- 
itantly, T2T-CHM13 possesses fewer exactly 
duplicated sequences (25 kbp) shared across 
chromosomes (excluding sequence pairs within 
centromeres) than GRCh38 (figs. S7 and S8). 
Specifically, GRCh38 possesses 28 large shared 
interchromosomal sequences, primarily consist- 
ing of pairs of subtelomeric sequences, with an 
additional 42 pairs involving at least one un- 
placed contig. All of these identical sequence 
pairs, except for one between two subtelomeres, 
are nonidentical in T2T-CHM13, as small but 
important differences between repetitive ele- 
ments have now been resolved (27, 33). 


T2T-CHMI3 accurately represents the haplotype 
structure of human genomes 


The human reference genome serves as the 
standard to which other genomes are com- 
pared, and is typically perceived as a haploid 
representation of an arbitrary genome from 
the population (25). In contrast with T2T-CHM13, 
which derives from a single homozygous com- 
plete hydatidiform mole, the Human Genome 
Project constructed the current reference genome 
via the tiling of sequences obtained from bac- 
terial artificial chromosomes (BACs) and other 
clones with lengths ranging from ~50 to >250 kbp 
(24), which derived from multiple donor individ- 
uals. GRCh38 and its predecessors thus comprise 
mosaics of many haplotypes, albeit with a single 
library (RP11) contributing the majority (24). 
To further characterize this aspect of GRCh38 
and its implications for population studies, we 
performed local ancestry inference for both 
GRCh38 and T2T-CHM13 through comparison 
to haplotypes from the IKGP (34) (Fig. 1A and 
figs. S2 and S9). Continental superpopulation- 
level ancestry was inferred for 72.9% of GRCh38 
clones on the basis of majority votes of nearest- 
neighbor haplotypes. For the remaining 27.1% 
of clones, no single superpopulation achieved 
a majority of nearest neighbors, and ancestry 
thus remained ambiguous. This ambiguity oc- 
curred primarily for short clones with few in- 
formative SNPs (fig. S10), but also for some 
longer clones with potential admixed ancestry. 
In accordance with Green et al. (24), we 
inferred that library RP11, which constitutes 
72.6% of the genome, is derived from an in- 
dividual of admixed African-American ances- 
try, with 56.0% and 28.1% of its component 
clones assigned to African and European local 
ancestries, respectively. The second most abun- 
dant library, CTD (5.5% of the genome), con- 
sists of clones of predominantly (86.3%) East 
Asian local ancestries, and the remaining 
libraries are derived from individuals of pre- 
dominantly European ancestries. In contrast, 
T2T-CHM13 exhibits European ancestries near- 
ly genome-wide (fig. S11). In addition, GRCh38 
and T2T-CHM13 harbor 26.7 Mbp and 51.0 Mbp, 
respectively, of putative Neanderthal-introgressed 
sequences that originated from ancient inter- 
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breeding between the two hominin groups 
~60,000 years ago (24). The excess of intro- 
gressed sequence in T2T-CHM13, even when 
restricting to the genomic intervals of GRCh38 
clones with confident ancestry assignments, 
is consistent with its greater proportion of 
non-African ancestry. 

We hypothesized that the mosaic nature of 
GRCh38 would generate abnormal haplotype 
structures at the boundaries of clones used 
for its construction, producing combinations 
of alleles that are rare or absent from the 
human population. Indeed, some previous 
patches of the reference genome sought to 
correct abnormal haplotype structures wher- 
ever noticed because of their impacts on genes 
of clinical importance (e.g., ABO and SLC39A4) 
(3). Such artificial haplotypes would mimic rare 
recombinant haplotypes private to any given 
sample, but at an abundance and genomic 
scale unrepresentative of any living human. 
To test this hypothesis, we identified pairs of 
common [minor allele frequency (MAF) > 10%] 
SNP alleles always observed on the same 
haplotype [i.e., segregate in perfect (R? = 1) 
linkage disequilibrium (LD)] in the 2504 
unrelated individuals of the 1KGP and queried 
the allelic states of these SNPs in both GRCh38 
and T2T-CHM13 (34). 

In accordance with our expectations, we 
identified numerous haplotype transitions in 
GRCh38 absent from the IKGP samples, with 
18,813 pairs of LD-discordant SNP alleles (i.e., 
in perfect negative LD) distributed in 1390 nar- 
row nonoverlapping clusters (median length = 
3703 bp) throughout the genome (Fig. 1C). Such 
rare haplotype transitions are comparatively 
scarce in T2T-CHM13, with only 209 pairs of 
common high-LD SNPs (50 nonoverlapping 
clusters) possessing allelic combinations absent 
from the 1KGP sample (Fig. 1D). Using a leave- 
one-out analysis, we confirmed that T2T-CHM13 
possesses a similar number of LD-discordant 
haplotypes as phased “haploid” samples from 
1KGP, whereas GRCh38 vastly exceeds this 
range (Fig. 1E). By intersecting the GRCh38 
results with the tiling path of BAC clones, we 
found that 88.9% (16,733 of 18,813) of dis- 
cordant SNP pairs straddle the documented 
boundaries of adjacent clones (fig. $12). Of 
these, 45.9% (7686 of 16,733) of the clone 
pairs derived from different BAC libraries, 
whereas the remainder likely largely reflects 
random sampling of distinct homologous 
chromosomes from the same donor individual. 
Thus, our analysis suggests that T2T-CHM13 
accurately reflects haplotype patterns observed 
in contemporary human populations, whereas 
GRCh38 does not. 


T2T-CHMI13 corrects genomic collapsed 
duplications and falsely duplicated regions 


Genome assemblies often suffer from errors in 
complex genomic regions such as SDs. In the 
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case of GRCh38, targeted sequencing of BAC 
clones has been performed to fix many such 
loci (3, 16, 35-38), but problems persist. To sys- 
tematically identify errors in GRCh38 that 
could produce spurious variant calls, we lever- 
aged the fact that T2T-CHM13 is an effectively 
haploid cell line that should produce only ho- 
mozygous variants when its sequence is aligned 
to GRCh38. Thus, any apparent heterozygous 
variant can be attributed to mutations accrued 
in the cell line, sequencing errors, or read map- 
ping errors. In the last case, assembly errors or 
copy number polymorphism of SDs produce 
contiguous stretches of heterozygous variants 
(39), which confound the accurate detection of 
paralog-specific variants (PSVs). By mapping 
PacBio HiFi reads from the CHM13 cell line 
(27) as well as Illumina-like simulated reads 
(150 bp) obtained from the T2T-CHM13 ref- 
erence to GRCh38, we identified 368,574 
heterozygous SNVs within the autosomes and 
chromosome X, of which 56,413 (15.3%) were 
shared between datasets. This evidence shows 
that each technology is distinctively informa- 
tive as a result of differences in mappability 
(fig. $13 and table S1). 

To home in on variants deriving from col- 
lapsed duplications, we delineated “clusters” 
of heterozygous calls (34) and identified 908 
putative problematic regions (541 supported 
by both technologies) comprising 20.8 Mbp 
(Fig. 1 and fig. $13). Many of these loci inter- 
sected SD-associated regions (668/908; 73.6%) 
and centromere-associated regions (542/908; 
59.7%) (31) as well as known GRCh38 issues 
(341/908; 37.55%). Variants flagged as excessively 
heterozygous in the population by gnomAD 
(6) were significantly enriched in these regions 
(10,000 permutations, empirical P = 1 x 10~), 
representing 23.6% (87,005/368,574) of our dis- 
covered CHM13 heterozygous variants, which 
suggests that these spurious variants arise in 
genome screens and represent false positives 
(Fig. 1A and figs. S1 and S3). 

We next “lifted over” (i.e., converted the 
coordinates of) 821 of these 908 putative 
problematic regions to the T2T-CHM13 assembly 
and used human copy number estimates [n = 
268 individuals from the Simons Genome 
Diversity Project (SGDP)] (32, 40) to conserv- 
atively identify 203 loci (8.04 Mbp) evidencing 
missing copies in GRCh38 (fig. S14). These 
regions have an impact on 308 gene features, 
with 14 of the total 48 protein-coding genes 
fully contained within a problematic region, 
indicating that complete gene homologs are 
hidden from GRCh38-based population analy- 
ses of variation. Examples include DUSP22, a 
gene involved in immune regulation (41), as 
well as KMT2C, a gene implicated in Kleefstra 
syndrome 2 (42) (fig. S15). Additionally, we 
identified 30 SNPs within problematic regions 
with known phenotype associations from the 
GWAS Catalog (43). Finally, we evaluated the 
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status of these regions in the T2T-CHM13 
reference by following a similar approach to 
obtain 9193 heterozygous variants clustered in 
11 regions—none of which overlapped GRCh38 
problematic regions (table $2). As a result, 
we are now able to call variants in these 48 
previously inaccessible protein-coding genes. 
We did identify one putative collapsed dupli- 
cation in T2T-CHM13, based on the presence 
of a heterozygous variant cluster and reduced 
copy number in T2T-CHM13, localized to an 
rDNA array corrected in the most recent version 
of T2T-CHMI3v1L1 (27). 

Conversely, the T2T-CHM13 reference also 
corrects regions falsely portrayed as dupli- 
cated in GRCh38. Specifically, we identified 
12 regions affecting 1.2 Mbp and 74 genes 
(including 22 protein-coding genes) with dup- 
lications private to GRCh38 and not found in 
T2T-CHM13 or the 268 genomes from SGDP 
(40) (fig. S14 and table S3). In contrast, only 
five regions affecting 160 kbp have duplica- 
tions in T2T-CHM13 that are not in GRCh38 
or the SGDP, which suggests that genuine rare 
variation cannot explain the excess of private 
duplications in GRCh38. Indeed, upon inspect- 
ing the CHM13 data, we deemed that these five 
loci are true duplications with support from 
mapped HiFi reads (30). 

The five largest duplications in GRCh38, 
affecting 15 protein-coding genes on the q-arm 
of chromosome 21, involve BAC clones with se- 
quence misplaced between gaps on the hetero- 
chromatic p-arm of the same chromosome. 
The Genome Reference Consortium (GRC)—an 
international team of researchers that has main- 
tained and improved the reference genome and 
related resources since its initial publication— 
determined that admixture mapping incorrect- 
ly localized these five clones to the acrocentric 
short arm and therefore should not have been 
added to GRCh38 (34). Of the seven false dup- 
lications outside chromosome 21, two occur in 
short contigs between gaps, two occur adja- 
cent to a gap, two occur on unlocalized “ran- 
dom” contigs, and one occurs as a tandem 
duplication (table S4). We provide an exhaus- 
tive list of falsely duplicated gene pairs corrected 
in T2T-CHM13 (table S5). Thus, T2T-CHM13 
authoritatively corrects many false duplica- 
tions, improving variant calling for short- and 
long-read technologies, including in medically 
relevant genes. 


Liftover of clinically relevant and trait-associated 
variation from GRCh38 to T2T-CHMI13 


In transitioning to a different reference genome, 
it is imperative to document the locations of 
known genetic variation of biological and 
clinical relevance with respect to the updated 
coordinate system. To this end, we sought to lift 
over 802,674 unique variants in the ClinVar 
database and 736,178,420 variants from the 
NCBI dbSNP database (including 151,876 
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NHGRI-EBI GWAS Catalog variants) from 
the GRCh38 reference to the T2T-CHM13 
reference. Liftover was successful for 800,942 
(99.8%) ClinVar variants, 723,117,125 (98.2%) 
NCBI dbSNP variants, and 150,962 (99.4%) 
GWAS Catalog SNPs (table S6). We provide 
these lifted-over datasets as a resource for the 
scientific community within the UCSC Genome 
Browser and the NHGRI AnVIL, along with lists 
of all variants that failed liftover and the as- 
sociated reasons (Fig. 1, A and B, and figs. S1 and 
S4). Critically, this resource includes 138,319 of 
138,927 (99.6%) ClinVar variants annotated as 
“pathogenic” or “likely pathogenic.” 

Of the 1732 ClinVar variants that failed to 
lift over, 1186 overlap documented insertions 
or deletions that distinguish the GRCh38 and 
T2T-CHM13 assemblies. The remaining 546 
variants (<0.1% of all variants) lie within regions 
of poor alignment between the GRCh38 and 
T2T-CHM13 assemblies (Fig. 1E). The modes 
of liftover failure for variants in dbSNP and 
the GWAS Catalog follow similar distributions 
(table S6). In all, these annotated variants offer 
a resource to enable researchers to interpret 
genetic results using the T2T-CHM13 assembly. 


T2T-CHM13 improves analysis of global 
genetic diversity based on 3202 short-read 
samples from the 1KGP dataset 
T2T-CHM13 improves short-read mapping 
across populations 


To investigate how the T2T-CHM13 assembly 
affects short-read variant calling, we realigned 
and reprocessed all 3202 samples from the 
1KGP cohort (28) using the NHGRI AnVIL 
Platform (44) (figs. S16 and S17). In this col- 
lection, each sample is sequenced to at least 30x 
coverage with paired-end Illumina sequencing, 
with samples from 26 diverse populations across 
five major continental superpopulations (fig. 
$18). Although most samples are unrelated, the 
expanded collection includes 602 complete trios 
that we used to estimate the rate of false variants 
below on the basis of discordance with Mende- 
lian expectations. We matched the analysis pipe- 
line for GRCh38 (28) as closely as possible so 
that any major differences would be attribut- 
able to the reference genome rather than tech- 
nical differences in the analysis software (34). 

On average, BWA-MEM (45) maps an addi- 
tional 7.4. x 10° (0.97%) of properly paired reads 
to T2T-CHM13 compared to GRCh38, even when 
considering the alternative (ALT) and decoy 
sequences used in the original analysis (fig. 
S19). Interestingly, even though more reads 
align to T2T-CHM13, the subsequent per-read 
mismatch rate is 20 to 25% lower across all 
continental populations. African samples con- 
tinue to present the highest mismatch rate 
(Fig. 2A), as the observed mismatch rate in- 
cludes both genuine sequencing errors, which 
are largely consistent across all samples, and 
any true biological differences between the 
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read and the reference genome, which vary 
substantially according to the ancestry of the 
sample. Relatedly, T2T-CHM13 improved other 
mapping characteristics, including reducing the 
number of misoriented read pairs (Fig. 2A). 
Finally, by considering the alignment cover- 
age across 500-bp bins across the respective 
genomes, we observed improvement in cover- 
age uniformity within every sample’s genome 
when using T2T-CHM13 rather than GRCh38. 
For example, within gene regions, we noted a 
factor of 4 decrease in the standard deviation 
of the coverage (Fig. 2A) and similar improve- 
ments in other types of genomic regions among 
all population groups (fig. S20). Overall, these 
improvements in error rates, mapping charac- 
teristics, and coverage uniformity demonstrate 
the superiority of T2T-CHM13 as a reference 
genome for short-read alignment across all 
populations. 


T2T-CHMI13 improves variant calling 
across populations 


From these alignments, we next generated 
SNV and small indel variant calls with the 
GATK Haplotype Caller, which uses a joint 
genotyping approach to optimize accuracy 
across large populations (46). Again, we 
matched the pipeline used in the prior IKGP 
study, albeit with updated versions of some 
analysis tools, to minimize software discrep- 
ancies and attribute differences to changes 
in the reference genome. Across all samples, 
we identified 126,591,489 high-quality (“PASS”) 
variants relative to T2T-CHM13 (per-sample 
mean, 4,717,525; median, 4,419,802) compared 
to 125,484,020 variants relative to GRCh38 (per- 
sample mean, 5,101,897; median, 4,867,871), 
additionally noting a decrease in the number 
of called variants per individual genome (Fig. 
2B and fig. S21). We performed all subsequent 
analyses using these high-quality variants, as 
the PASS filter successfully removed spurious 
variants (fig. S22), particularly in complex re- 
gions (fig. $23). 

As with the improvement to the per-read 
mismatch rate, we attribute the reduction 
in the number of per-sample variant calls to 
improvements in the number of rare alleles, 
consensus errors, and structural errors in 
T2T-CHM13. This conclusion is supported 
by the observation that the number of hetero- 
zygous variants per sample is more similar 
(Fig. 2C and fig. S24) across reference genomes 
in contrast to homozygous variants (Fig. 2D 
and fig. S24). This discrepancy is especially 
pronounced in non-African samples, which 
have on average 200,000 to 300,000 more 
homozygous variants relative to GRCh38 than 
to T2T-CHM13, likely because ~70% of the 
GRCh38 sequence comes from an individual 
with African-American ancestry, and African 
populations are enriched for rare and private 
variants (18). 
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Fig. 2. Improvements to short-read mapping and variant calling. (A) Summary 
of alignment characteristics aligning to T2T-CHM13 instead of GRCh38. (B) Boxplot 
of overall number of variants found in each person across superpopulations, 
with colors indicated in Fig. 1A legend. (C) Boxplot of the number of heterozygous 
variants found in each person across superpopulations. (D) Boxplot of the number 
of homozygous variants found in each person across superpopulations. (E) AF 
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distribution of each superpopulation relative to T2T-CHM13 and GRCh38. (F) Change 
in AF distribution. (@) Number of variants with AF equal to 100%, both within 
protein-coding genes and without. (H) Number of variants with AF equal to 50%, both 
within putative collapsed duplications and without. (1) Violin plot of the number of 
low-quality variants found when aligning to GRCh38 and T2T-CHM13. (J) Violin plot of 
the number of Mendelian violations found when aligning to GRCh38 and T2T-CHMI13. 
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Further investigating this relationship, we 
computed the AFs of variants from unrelated 
samples from each of the five continental 
superpopulations (Fig. 2E). Although the 
distributions were nearly equivalent over 
most of the AF spectrum, we observed substan- 
tial differences for rare alleles (AF < 0.05); 
intermediate-frequency alleles, including errors 
where nearly all individuals are heterozygous 
(AF = 0.5); and fixed or nearly fixed alleles (AF > 
0.95). The most prominent difference in AF 
distributions affected fixed or nearly fixed alleles 
in each assembly, for which all non-African 
superpopulations showed an excess of ~150,000 
variants in GRCh38, whereas the African super- 
population showed an excess of 2364 variants 
in T2T-CHMI13 (Fig. 2F). This observation was 
driven by a decrease in the number of com- 
pletely fixed variants (100% AF) relative to 
GRCh38 (Fig. 2G). Such variants represent 
positions where the reference genome itself 
is the only sample observed to possess the 
corresponding allele. These alleles arise either 
because of genuine private variants in one of 
the GRCh38 donors, or from errors in the ref- 
erence genome itself, and result in 100% of 
other individuals possessing two copies of the 
alternative allele. As a result, these “variants” 
will not be reported at all if the same reads are 
mapped to a different genome that does not 
have these private alleles. Interestingly, the 
number of such private “singleton” variants 
in T2T-CHM13 lies squarely within the observed 
range of singleton counts among 1KGP sam- 
ples, adjusting for the difference in ploidy (fig. 
$25). In addition to the lower rate of private 
variation relative to GRCh38, T2T-CHM13 pos- 
sesses fewer ultra-rare variants, effectively re- 
ducing the number of “nearly fixed” alleles in 
population data such as the 1KGP. 

Finally, the reduction in AF ~ 0.5 variants 
is largely explained by the corrections to col- 
lapsed SDs (table S1), as these regions are 
highly enriched for heterozygous PSVs in 
nearly all individuals. Such erroneous heterozy- 
gous variants are caused by the false pileup of 
reads from the duplicated regions to a single 
location (Fig. 2H). Collectively, the decrease in 
variants with AF = 1 and AF ~ 0.5 largely ex- 
plains the decrease in the overall number of 
variants observed per sample and across the 
entire population for T2T-CHM13. 

Informed by these results, we considered 
the feasibility of calling variants using the 
T2T-CHM13 reference and then lifting over 
the results to GRCh38 for further analyses. 
Using a liftover tool to transform a variant 
call set for a single sample into a call set with 
respect to GRCh38 requires special handling 
to account for variants for which the two 
references have different alleles. Specifically, 
if one of the reference alleles is not present in 
the sample, it will be necessary to genotype 
the site against the T2T-CHM13 reference. 


Aganezov et al., Science 376, eabl3533 (2022) 


Although this issue is less of a concern for 
large datasets such as the IKGP, even these 
large samples will contain a small number 
of variants that become invisible when switch- 
ing reference genomes (fig. S26). In addition, 
differences in variant representation, especially 
in regions of low complexity, may cause lifted 
variant sets to differ from those called against 
the target reference. 


Reduction of Mendelian-discordant variants 


As further quality control for the variant calls, 
we performed a Mendelian concordance anal- 
ysis using the 602 trios represented in the 
1KGP cohort. We observed a statistically sig- 
nificant decrease in both the number of low- 
quality variants [median, 890,701 (GRCh38) 
versus 682,609 (T2T-CHM13); P = 4.943 x 
10°°°, Wilcoxon signed-rank test] (Fig. 21) 
and the number of Mendelian-discordant 
variants (i.e., variants found in children but 
not their parents, or homozygous parental 
variants not observed in their children) 
[median, 8879 (GRCh38) versus 7484 (T2T- 
CHM13); P = 7.346 x 10°°°, Wilcoxon signed- 
rank test] (Fig. 2J) when aligned to T2T-CHM13 
as compared to GRCh38. In addition to pro- 
viding an estimate of the error rate for variant 
calls in this call set, this improvement has 
broad implications for clinical genetics analy- 
ses of de novo or somatic mutations, which 
have been implicated as causes of autism 
spectrum disorders (47) and many forms of 
cancer (48). 


T2T-CHM13 improves SV analysis of 17 diverse 
long-read samples 

T2T-CHMI13 improves long-read mapping 

across populations 


Next, we investigated the effects of using T2T- 
CHM13 as a reference genome for alignment 
and large SV calling from both PacBio HiFi 
and ONT long reads. To this end, we aligned 
reads and called SVs in 17 samples of diverse 
ancestries from the Human Pangenome Refe- 
rence Consortium (HPRC+) (27) and the 
Genome in a Bottle Consortium (GIAB) (29), 
including two trios (table $7). All of these 
samples had HiFi data available, and 14 had 
also been sequenced with ONT (Fig. 3A), with 
mean read lengths of 18.1 kbp and 21.9 kbp 
and read N50 values of 18.3 kbp and 44.9 kbp, 
respectively (fig. S27). 

In line with our short-read results, aligning 
long reads to T2T-CHM13 versus GRCh38 did 
not substantially change the number of reads 
mapped with either Winnowmap (49) or 
minimap2 (50) because most of the previously 
unresolved sequence in T2T-CHM13 represents 
additional copies of SDs or satellite repeats 
already partially represented in GRCh38 (fig. 
$28). However, aligning to T2T-CHM13 re- 
duced the observed mismatch rate per mapped 
read by 5% to 40% across the four combina- 
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tions of sequencing technologies and aligners 
because GRCh38 has more rare alleles. T2T- 
CHM13 also corrects structural errors in 
GRCh38 and is a complete assembly of the 
genome, which facilitates accurate alignment, 
similar to what we observed for short reads 
(Fig. 3B). Relatedly, we found that previously 
reported African-specific (57) and Icelandic- 
specific (52) sequences at least 1 kbp in length 
align with substantially greater identity and 
completeness to T2T-CHM13 than to GRCh38 
(34) (fig. S29 and S30). 

To study coverage uniformity, we next mea- 
sured the average coverage across each 500-bp 
bin on a per-sample basis and computed the 
standard deviation of the coverage. Across 
all aligners and technologies, the median stan- 
dard deviation of the per-bin coverage was re- 
duced by more than a factor of 3, indicating 
more stable mapping to T2T-CHM13 (Fig. 3B). 
This difference in coverage uniformity was 
pronounced in satellite repeats and other 
regions of GRCh38 that are nonsyntenic with 
T2T-CHM13 (fig. S31 and S32). This coverage 
uniformity will broadly improve variant call- 
ing and other long read-based analyses. 


T2T-CHM13 improves the balance of apparent indels 


We next used our optimized SV-calling pipeline, 
including Sniffles (53), Iris, and Jasmine (54), to 
call SVs in all 17 samples (figs. S33 and S34) 
and consolidate them into a cohort-level call 
set in each reference from HiFi data. From 
these results, we observed a reduction from 
5147 to 2260 SVs that are homozygous in all 
17 individuals when calling variants relative 
to T2T-CHM13 instead of GRCh38 (Fig. 3C). 
Previous studies (16, 77) have noted the ex- 
cess of such SV calls when using GRCh38 as a 
reference and attributed them to structural 
errors. Here, we found that using a complete 
and accurate reference genome naturally re- 
duces the number of such variants. In addi- 
tion, the number of indels was more balanced 
when calling against T2T-CHM13, whereas 
GRCh38 exhibited a bias toward insertions 
caused by missing or incomplete sequence 
(Fig. 3D), such as incorrectly collapsed tandem 
repeats (16). With respect to T2T-CHM13, we 
observed a small bias toward deletions, which 
likely results from the challenges in calling 
insertions with mapping-based methods and 
in representing SVs within repeats, as this 
difference is especially prominent in highly 
repetitive regions such as centromeres and 
satellite repeats (fig. S35). The variants we 
observed relative to T2T-CHM13 are enriched 
in the centromeres and subtelomeric sequen- 
ces, likely because of a combination of repetitive 
sequence and greater recombination rates (7). 
We observed similar trends among SVs unique 
to single samples (fig. S36). 

We also observed similar improvements in 
the insertion/deletion balance for large SVs 
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Fig. 3. Improvements to long-read alignment and SV calling in CHM13. (A) The 
coverage, ancestry, and sequencing platforms available for the 17 samples 
sequenced with long reads (headers: AFR, African; AMR, Admixed American; ASH, 
Ashkenazi; EAS, East Asian; SAS, South Asian; populations: ACB, African Caribbean 
in Barbados; ASH, Ashkenazi; CHS, Southern Han Chinese; CLM, Colombian in 
Medellin, Colombia; GWD, Gambian in Western Division, The Gambia; KHV, Kinh in 
Ho Chi Minh City, Vietnam; MSL, Mende in Sierra Leone; PJL, Punjabi in Lahore, 
Pakistan; PUR, Puerto Rican in Puerto Rico). (B) The genome-wide mapping error 
rate and the standard deviation of the coverage for T2T-CHM13 (orange) and 
GRCh38 (blue). The standard deviation was computed across each 500-bp 
bin of the genome. (C) The allele frequency of SVs derived from HiFi data in 
T2T-CHM13 and GRCh38 among the 17-sample cohort. The red arrows indicate 
fixed (100% frequency) variants. (D) The balance of insertions (INS) vs. deletion 
(DEL) calls in the 17-sample cohort in T2T-CHM13 and GRCh38. Variants in T2T-CHMI13 
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are stratified by whether or not they intersect regions which are nonsyntenic with 
GRCh38. (E) The SV calls in T2T-CHM13 for two trios: a trio of Ashkenazi ancestry 
[child HGOO2, and parents HGOO3 (46XY), and HGO04 (46XxX)], and a trio of Han 
Chinese ancestry [child HGO05, and parents HGO06 (46XY) and HGO07 (46XX)]. The 
red arrows indicate child-only, or candidate de novo, variants (DEL, Deletion; DUP, 
Duplication; INS, Insertion; INV, Inversion; TRA, Translocation). (F) The density of SVs 
called from HiFi data in the 17-sample cohort across T2T-CHMI13. (G) Alignments of HiFi 
reads in the HGO02 trio to T2T-CHM13 showing a deletion spanning an exon of the 
transcript AC134980.2 viewed using the Integrative Genomic Viewer (IGV). Pink 
horizontal rectangles indicate reads aligned to the forward strand; blue horizontal 
rectangles indicate reads aligned to the reverse strand. Thin black lines indicate split- 
read alignments. Small vertical rectangles indicate SNVs (H) Alignments of HiFi reads 
in the HGO02 trio to the same region of GRCh38 as shown in (G), showing much poorer 
mapping to GRCh38 than to T2T-CHM13, viewed using IGV with colors same as (G). 
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(>500 bp) detected by Bionano optical map- 
ping data in HGO02 against the T2T-CHM13 
reference, with an increase in deletions (1199 
versus 1379) and a decrease in insertions (2771 
versus 1431) with GRCh38 and T2T-CHM13, 
respectively (fig. S37). Using the T2T-CHM13 
reference for Bionano optical mapping also 
improved SV calling around gaps in GRCh38 
that are closed in T2T-CHM13 (fig. S38), which 
suggests that T2T-CHM13 offers improved indel 
balance relative to GRCh38 across multiple SV- 
calling methods. 


T2T-CHMI3 facilitates the discovery of 
de novo SVs 


To investigate the impacts of the T2T-CHM13 
reference on our ability to accurately detect de 
novo variants, we called SVs in both of our trio 
datasets using a combination of HiFi and 
ONT data and identified SVs only present in 
the child of the trio and supported by both 
technologies—approximately 40 variants per 
trio (Fig. 3E). Manual inspection revealed a 
few variants in each trio that were strongly 
supported with consistent coverage and align- 
ment breakpoints, whereas the other candi- 
dates exhibited less reliable alignments, as 
noted in previous reports (54). In HG002, we 
detected six strongly supported candidate de 
novo SVs that had been previously reported 
(29, 54). In HGOO05, we detected a 1571-bp 
deletion at chr17:49401990 in T2T-CHM13 that 
was supported as a candidate de novo SV rela- 
tive to both T2T-CHM13 and GRCh38 (fig. $39). 
This demonstrates the ability of T2T-CHM13 
to be used as a reference genome for de novo 
SV analysis. 


T2T-CHMI13 enables the discovery of additional 
SVs within previously unresolved sequences 


The improved accuracy and completeness of 
the T2T-CHM13 genome make it easier to 
resolve complex genomic regions. Within non- 
syntenic regions, we identified a total of 27,055 
SVs (Fig. 3D), the majority of which were 
deletions (15,998) and insertions (10,912). Of 
these SVs, 22,362 (82.7%; 8903 insertions, 
13,334 deletions) overlap previously unresolved 
sequences in T2T-CHM13, whereas the remain- 
ing SVs are now accessible because of the 
accuracy of the T2T-CHM13 reference. The AF 
and size distributions for these variants mirror 
the characteristics of the syntenic regions, with 
rare variants (fig. S40) and smaller (30 to 50 bp) 
indels (fig. S41) being the most abundant. 
However, we also note some nonsyntenic re- 
gions with few or zero SVs identified. Many 
of these regions lie at the interiors of p-arms 
of acrocentric centromeres, which are gaps in 
T2T-CHM13v1.0 that have been resolved in 
later versions of the assembly; however, we 
also noticed depletions of SVs in a few other 
highly repetitive regions, such as the resolved 
human satellite array on chromosome 9 (Fig. 
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3F). We largely attribute the reduction in variant 
density to the low mappability of these complex 
and repetitive regions. Future improvements 
in read lengths and alignment algorithms are 
needed to further resolve such loci. 

Within syntenic regions, we also noted im- 
provements to alignment and variant calling 
accuracy, including the identification of variant 
calls not previously observed within homolo- 
gous regions of GRCh38. For example, in T2T- 
CHM13, we observed a deletion in all of the 
samples of the HGOO2 trio in an exon of the 
olfactory receptor gene ACI134980.2 (Fig. 3G), 
whereas the reads from those samples largely 
failed to align to the corresponding region 
of GRCh38 (Fig. 3H). Meanwhile, reads from 
African samples (fig. S42) aligned to both re- 
ferences at this locus. The difference in align- 
ment among different samples is likely because 
the region is highly polymorphic for copy num- 
ber variation; GRCh38 contains a reasonable 
representation of that region for the tested 
African samples, whereas the homologous re- 
gion in T2T-CHM13 more closely resembles 
European samples (fig. $43). This highlights 
the need for T2T reference genomes for as many 
diverse individuals as possible to account for 
common haplotype diversity. 


Variation within previously unresolved regions 
of the genome 

T2T-CHMI13 enables variant calling in previously 
unresolved and corrected regions of the genome 


The T2T-CHM13 genome contains 229 Mbp of 
sequence that is nonsyntenic to GRCh38, which 
intersects 207 protein-coding genes (Fig. 4A and 
Table 1). Within these regions, we report 
3,692,439 PASS variants across all 1KGP sam- 
ples from short reads (Fig. 4B and Table 1). 
Comparing variants called in a subset of 14 
HPRC+ samples with Illumina, HiFi, and ONT 
data, we found that 73 to 78% of the Illumina- 
discovered SNVs are concordant with variants 
identified with PacBio HiFi long-read data 
using the PEPPER-Margin-DeepVariant algo- 
rithm (51,306 to 74,122 matching SNVs and 
genotypes per sample) (55). Long reads discovered 
more than 10 times as many SNVs per sample 
as short reads in these regions, with 447,742 to 
615,085 (41 to 43%) SNVs matching between 
HiFi and ONT with PEPPER-Margin-Deep- 
Variant. In nonsyntenic regions, 97% of the 
SNVs called by HiFi fell in centromeric regions 
of CHM13, so we stratified concordance by type 
of satellite repeat within the centromere. We 
found that nonsatellites in centric transition 
regions and monomeric satellites had higher 
concordance between HiFi and ONT, with >99% 
concordance in a few regions, but some as low 
as 50%. Human satellite (HSAT) regions, which 
pose some of the greatest challenges for read 
mapping and harbor abundant structural varia- 
tion, exhibit the lowest rates of concordance 
between the platforms. 
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We further defined conservative high- 
confidence regions by excluding regions with 
abnormal coverage in any long-read sample 
(i.e., coverage outside of 1.5x the interquar- 
tile range). This effectively excludes difficult- 
to-map regions with excessively repetitive 
alignments as well as copy number-variable 
regions. After excluding abnormal coverage 
from nonsyntenic regions, 14 Mbp remained, 
and SNVs from HiFi and ONT long reads were 
91 to 95% concordant (21,835 to 28,237 var- 
iants). We found 95 to 96% (14,575 to 18,949) 
of short-read SNVs in HiFi long-read calls, 
although 37 to 40% of HiFi SNVs were still 
missing from the short-read calls as a result of 
poorer mappability of the short reads (table 
S8). Although many nonsyntenic regions will 
require further method development [e.g., 
pangenome references (56)] to achieve accu- 
rate variant calls, the concordance of long- and 
short-read calls for tens of thousands of var- 
iants highlights previously unresolved sequen- 
ces that are immediately accessible to both 
technologies. 

Because these broadly defined nonsyntenic 
regions include inversions and other struc- 
tural changes between GRCh38 and T2T- 
CHM13 that do not necessarily alter many of 
the variants contained within, we also con- 
sidered a narrower class of “previously un- 
resolved” sequences, representing segments 
of the T2T-CHM13 genome that do not align 
to GRCh38 with Winnowmap (49). Within 
these previously unresolved sequences, which 
span a total of 189 Mbp (Fig. 4A, Table 1, and 
fig. S44), we report a total of 2,370,384 PASS 
variants in IKGP samples based on short reads, 
intersecting 207 protein-coding genes (Fig. 4B, 
Table 1, and fig. S45). We note that this set of 
207 genes is distinct from the 207 genes that 
intersected with the nonsyntenic regions, and 
these two sets together comprise 329 unique 
genes. Because these previously unresolved 
sequences are enriched for highly repetitive 
sequences, concordance is slightly lower, such 
that 64 to 69% of the SNVs in each sample 
match variants found in PacBio HiFi long-read 
data from the same samples (24,371 to 36,501 
matching SNVs and genotypes per sample), 
and 339,783 to 473,074 (38 to 40%) of SNVs 
match between HiFi and ONT. When removing 
difficult-to-map and copy number-variable re- 
gions as above, 3 Mbp of high-confidence re- 
gions remained. Within high-confidence regions, 
84 to 88% of short-read SNVs in each sample 
matched variants found in each sample’s PacBio 
HiFi long-read data (2938 to 3811 matching 
SNVs and genotypes per sample), and 5544: to 
8298 (81 to 90%) of SNVs matched between 
HiFi and ONT (table S8). Although these prev- 
iously unresolved regions are more challeng- 
ing than nonsyntenic regions, thousands of 
variants can still be called concordantly with 
short and long reads. 
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Fig. 4. Characterization of variants within regions of the genome resolved 
by T2T-CHM13. (A) Number of bases added in nonsyntenic and previously 
unresolved regions by chromosome, along with how many variants for each 
respective region are mappable (have contiguous unique 10Omers). (B) Number 
of variants in nonsyntenic and previously unresolved regions by chromosome. 
(C) Distance from each previously unresolved-only, nonsyntenic-only, or 
overlapping region to the closest Clinvar or GWAS Catalog variant. Insets are 
zoomed to 1 Mbp. (D) Scan for variants in nonsyntenic (light blue and red) and 


We noted homology between GRCh38 col- 
lapsed duplications and many T2T-CHM13 
nonsyntenic and/or previously unresolved re- 
gions (137 regions comprising 6.8 Mbp), in- 
dicating that the T2T-CHM13 assembly corrects 
these sequences through the deconvolution 
of nearly identical repeats. Comparing total 
variants identified in the IKGP dataset, we 
observed a significant decrease in variant den- 
sities of 41 protein-coding genes intersect- 
ing with GRCh38 collapsed duplications in 
T2T-CHM13 (mean, 27 variants per kbp) com- 
pared with GRCh38 (mean, 46 variants per 
kbp; P = 6.906 x 10°, Wilcoxon signed-rank 
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test) (fig. S46). Besides differences in local 
ancestries between the references, these higher 
variant densities in GRCh38 in part represent 
PSVs or misassigned alleles from missing 
paralogs (57). Conversely, IKGP variants were 
significantly increased in 32 protein-coding 
genes contained within GRCh38 false du- 
plications using the T2T-CHM13 reference ge- 
nome (mean values of 48 variants per kbp in 
T2T-CHM13 versus 12 variants per kbp in 
GRCh38; P = 4.657 x 107°, Wilcoxon signed- 
rank test). 

To assess whether these corrected complex 
regions in T2T-CHM13 accurately reveal variation, 
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previously unresolved (dark blue and red) regions that exhibit extreme patterns 
of allele frequency differentiation. Allele frequency outliers were identified for 
each of eight ancestry components, colored by the superpopulation membership 
of the corresponding 1KGP samples. Large values of the likelihood ratio statistic 
(LRS) denote variants for which AF differences in the corresponding ancestry 
component exceeds that of a null model based on genome-wide covariances in 
allele frequencies. (E and F) Population-specific allele frequencies of two highly 
differentiated variants in previously unresolved regions. 


we evaluated the concordance of variants 
generated from short-read [lumina and Pac- 
Bio HiFi sequencing datasets of two trios from 
the GIAB consortium and the Personal Genome 
Project (58) and observed similar recall for 
Illumina data in T2T-CHM13 (20.1 to 28.3%) 
and GRCh38 (21.5 to 25.4%), but with im- 
proved precision in the variants identified 
(98.1 to 99.7% in T2T-CHM13 versus 64.3 to 
67.3% in GRCh38) in a subset of the GRCh38 
collapsed duplications (copy number < 10; 
~910 kbp) (table S9). Corrected false duplica- 
tions (1.2 Mbp) exhibited improved recall for 
Illumina data by a factor of 50 relative to HiFi 
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Table 1. Overview of nonsyntenic and previously unresolved regions and their respective variant counts. 


SNVs concordant between long reads 


SNVs concordant between long reads in high-confidence regions 


in T2T-CHM13 (57.4 to 68.3%) versus GRCh38 
(1.1 to 1.8%), as well as improved precision in 
T2T-CHM13 (98.5 to 99.3%) versus GRCh38 
(76.5 to 95.8%) (table S9). These improve- 
ments show that variants can be discovered 
and genotyped in regions corrected by the 
T2T-CHM13 assembly. 


Phenotypic associations and evolutionary 
signatures within nonsyntenic 
T2T-CHMI13 regions 


Sequences in the T2T-CHM13 assembly that 
are nonsyntenic with GRCh38 offer opportu- 
nities for future genetic studies. Several such 
loci lie in close proximity to variation that has 
been implicated in complex phenotypes or 
disease, supporting their potential biomedical 
importance. These include eight loci occur- 
ring within 10 kbp of GWAS hits and 19 loci 
within 10 kbp of ClinVar pathogenic variants 
(Fig. 4C). In addition, 113 of 22,474 GWAS hits 
(representing 0.5% of all variants in the studies 
we tested) segregated in LD (R? = 0.5) with 
variants in nonsyntenic regions, thereby ex- 
panding the catalog of potential causal var- 
iants for these GWAS phenotypes (43) (fig. S47 
and table S10). 

Using short read-based genotypes generated 
from the 1KGP cohort, we also searched for var- 
iants within nonsyntenic regions that exhibit 
large differences in AF between populations— 
a signature that can reflect historical positive 
selection or demographic forces shaping these 
previously inaccessible regions of the genome. 
To study these signatures, we applied Ohana 
(59), a method that models individuals as pos- 
sessing ancestry from * components and tests 
for ancestry component-specific frequency out- 
liers. Focusing on continental-scale patterns 
(k = 8; fig. S48), we identified 5154 unique 
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Nonsyntenic 


240,044,315 (228,569,315) 


3,692,439 (138,829) 


4] to 43% 


13,683,528 


60 to 63% 


91 to 95% 


SNVs and indels across all ancestry compo- 
nents that exhibited strong deviation from 
genome-wide patterns of AF (99.9th percentile 
of distribution for each ancestry component; 
Fig. 4D). These included 814 variants over- 
lapping with annotated genes and 195 variants 
that intersected annotated exons. 

We first focused on the 3038 highly differ- 
entiated nonsyntenic variants that lift over 
from T2T-CHM13 to GRCh38. These successful 
liftovers allowed us to make direct comparisons 
to selection results, generated with identical 
methods, using IKGP phase 3 data aligned to 
GRCh38 (fig. S49) (34, 60). For 41.3% of the 
lifted-over variants, we found GRCh38 var- 
iants within a 2-kbp window that possessed 
similar or higher likelihood ratio statistics for 
the same ancestry component, indicating that 
these loci were possible to identify in scans of 
GRCh38 (fig. S50). The remaining 58.7% of 
lifted-over variants may represent regions 
of the genome where differences in the T2T- 
CHM13 and 1KGP phase 3 variant calling or 
filtering procedures lead to discrepancies in 
AFs between these two datasets. They may 
also indicate regions whose more accurate 
representation in T2T-CHM13 improves var- 
iant calling enough to resolve previously un- 
known signatures of AF differentiation (fig. 
S51). We then investigated the 943 variants that 
could not be lifted over from T2T-CHM13 to 
GRCh38 and were located in both previously 
unresolved sequences and regions deemed map- 
pable from unique 100-mer analysis. Some of 
these variants overlap with genes, including 
several annotated with RNA transcripts in 
regions not present in the GRCh38 assembly 
(Fig. 4D and table S11). 

We highlight two loci that exhibit some of 
the strongest allele frequency differentiation 
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Previously unresolved 


189,036,735 (177,561,735) 


2,370,384 (52,567) 


38 to 40% 


Shei) S35) 


39 to 46% 
81 to 90% 


observed across ancestry components. The first 
locus, located in a centromeric alpha satellite 
on chromosome 16, contains variants that reach 
intermediate allele frequency in the ancestry 
component corresponding to the Peruvian in 
Lima, Peru (PEL) and other admixed American 
populations of 1KGP [AFs, 0.49 in PEL; 0.20 
in CLM (Colombian in Medellin, Colombia) 
and MXL (Mexican ancestry in Los Angeles, 
California); absent or nearly absent elsewhere; 
Fig. 4E and figs. S52 and S53]. Variants at the 
second locus, located in a previously unresolved 
T2T-CHM13 sequence on the X chromosome 
that contains a multi-kbp imperfect AT tandem 
repeat, exhibit high AFs in the ancestry com- 
ponent corresponding to African populations 
of IKGP and low AFs in other populations 
(AFs, 0.67 in African populations and 0.014 in 
European populations; Fig. 4F and figs. S54 
and $55). The variant at this locus with the 
strongest signature of frequency differentia- 
tion also lies within 10 kbp of two pseudo- 
genes, MOBIAP2-201 (MOB kinase activator 
1A pseudogene 2) and BX842568.1-201 (ferri- 
tin, heavy polypeptide-like 17 pseudogene). 

We note that as a consequence of the 
repetitive nature of the sequences in which 
they reside, many of the loci that we highlight 
here remain challenging to genotype with short 
reads, and individual variant calls remain 
uncertain. Nonetheless, patterns of AF dif- 
ferentiation across populations are relatively 
robust to such challenges and can still serve 
as proxies for more complex SVs whose se- 
quences cannot be resolved by short reads 
alone. The presence of population-specific 
signatures at these loci highlights the poten- 
tial for T2T-CHM13 to reveal evolutionary 
signals in previously unresolved regions of 
the genome. 
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Fig. 5. T2T-CHM13 improves clinical genomics variant calling. (A) Numbers of 
potential loss-of-function mutations in the T2T-CHM13 reference. (B) The counts of 
medically relevant genes affected by genomic features and variation in GRCh38 
(blue) and T2T-CHM13 (orange) are depicted as bar plots on logarithmic scale. Light 
blue indicates genes affected in GRCh38 where homologous genes were not identified 
in T2T-CHM13 due to inability to lift over, with counts included in parentheses. 

(C) An example erroneous GRCh38 complex SV corrected in T2T-CHMI3 affecting 
TNNT3 and LINCO1150, displayed by sequence comparison using miropeats (88) with 
homologous regions colored in green and blue, respectively. HGO02 PacBio HiFi 
data are displayed showing read coverages and mappings from IGV, with allele 
fractions of variant sites colored (red, T; green, A; blue, C; black, G) within 
histograms of read depth (0 to 50). (D and E) Snapshots of regions using IGV and 
UCSC Genome Browser representing a collapsed duplication in GRCh38 corrected 
in T2T-CHM13 affecting KCNJ18 (D) and a false duplication in GRCh38 affecting 
most of KCNEI (E). SDs depicted on top are colored by sequence similarity to 
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paralog (gray, 90 to 98%; orange, >99%). Read mappings and variants from 
HGOO2 Illumina, PacBio HiFi, and ONT (mappings only), with homozygous (light 
blue) and heterozygous (dark blue) variants depicted as dashes. Colors within 
histograms of read depth (0-120) are the same as described in (C). Copy number 
estimates, displayed as colors indicated in legends, across k-merized versions 

of the GRCh38 and T2T-CHMI3 references as well as representative examples of the 
SGDP individuals. (F) An example CDS region of KCNJ18 (highlighted as a red box 

in D), with amino acids colored in alternating shades of blue and potential start codons 
(methionines) labeled in green using the UCSC Genome Browser codon-coloring 
scheme. Alignments of KCNJ18 (blue), KCNJ12 (orange), and KCNJI7 (pink) along 
with allele counts of variants in each gene identified on GRCh38 and T2T-CHM13 
are shown as bar plots (to approximate scale per variant), with examples 1 to 

7 described in table S14. (G) Schematic depicts a benchmark for 269 challenging 
medically relevant genes for HGOO2. The number of variant-calling errors from 
three sequencing technologies on each reference is plotted. 
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Impact of T2T-CHM13 on clinical genomics 
Variants of potential clinical relevance 

in T2T-CHM13 

A deleterious variant in a reference genome 
can mislead the interpretation of a clinical 
variant identified in a patient because it may 
not be flagged as such using standard analysis 
tools. The GRCh38 reference genome is known 
to contain such variants that likely affect gene 
expression, protein structure, or protein function 
(25), although systematic efforts have sought 
to identify and remove these alleles (3). To 
determine the existence and location of loss- 
of-function variants in T2T-CHM13, we aligned 
the assembly to GRCh38 using dipcall (67) to 
identify and functionally annotate nucleotide 
differences (62) (Fig. 5A). This analysis identi- 
fied 210 putative loss-of-function variants (de- 
fined as variants that affect protein-coding 
regions and predicted splice sites) affecting 
189 genes, 31 of which are clinically relevant 
(23). These results are in line with work show- 
ing that the average diploid human genome 
contains ~450 putative loss-of-function variants 
affecting ~200 to 300 genes when low-coverage 
Illumina sequencing is applied (before stringent 
filtering) (63). 

Of these 210 variants, 158 have been iden- 
tified in at least one individual from the 1KGP, 
with most variants relatively common in human 
populations (median AF of 0.47), suggesting that 
they are functionally tolerated. The remaining 
variants not found in 1KGP individuals com- 
prise larger indels, which are more difficult 
to identify with 1IKGP Illumina data, as well as 
alleles that are rare or unique to CHM13. We 
curated the 10 variants affecting medically 
relevant genes and found seven that likely 
derived from duplicate paralogs: a 100-bp in- 
sertion also found in long reads of HG002, a 
stop gain in a final exon in one gnomAD sam- 
ple, and an insertion in a homopolymer in a 
variable-number tandem repeat in CEL, which 
may be an error in the assembly. Understand- 
ing that the T2T-CHM13 assembly represents 
a human genome harboring potentially func- 
tional or rare variants that in turn would affect 
the ability to call variants at those sites, we 
have made available the full list of putative 
loss-of-function variants to aid in the interpre- 
tation of sequencing results (table S12). 


T2T-CHM13 improves variant calling for 
medically relevant genes 


We sought to understand how the transition 
from GRCh38 to the T2T-CHM13 reference 
might have an impact on variants identified in 
a previously compiled (23) set of 4964 medi- 
cally relevant genes residing on human auto- 
somes and chromosome X (representing 4924 
genes in T2T-CHM13 via liftover; table S13). 
Of these genes, 28 mapped to previously un- 
resolved and/or nonsyntenic regions of T2T- 
CHM13. We found more than twice as many 
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medically relevant genes affected by rare or 
erroneous structural alleles on GRCh38 (n = 
756 including 14 with no T2T-CHM13 lift- 
over) compared to T2T-CHM13 (n = 306) (Fig. 
5B), of which 622 genes appear corrected in 
T2T-CHM13. This includes 116 genes falling 
in regions previously flagged as erroneous in 
GRCh38 by the GRC. The majority (82%) of 
affected clinically relevant genes in GRCh38 
overlap SVs that exist in all 13 HiFi-sequenced 
individuals, likely representing rare alleles or 
errors in the reference (see above), including 
13 of the 14 genes with no T2T-CHM13 liftover. 

One example of a resolved gene structure 
involves TNNTS3, which encodes Troponin T3, 
fast skeletal type, and is implicated in forms 
of arthrogryposis (64). When calling SVs with 
respect to GRCh38, TNNT3 was previously 
postulated to be affected by a complex struc- 
tural rearrangement in all individuals, consist- 
ing of a 24-kbp inversion and 22-kbp upstream 
deletion, which also ablates LINCO1150 (Fig. 
5C). The GRC determined that a problem ex- 
isted with the GRCh38 reference in this region 
(GRC issue HG-28). Analysis of this region in 
T2T-CHM13 instead shows a complex rear- 
rangement with the 22-kbp region upstream of 
TNNT3 inversely transposed in the T2T-CHM13 
assembly to the proximal side of the gene. Be- 
sides potentially affecting interpretations of 
gene regulation, this structural correction of 
the reference places TNNT3 >20 kbp closer to 
its genetically linked partner TNNI2 (65). Other 
genes have variable number tandem repeats 
(VNTRs) that are collapsed in GRCh38, such 
as one expanded by 17 kbp in most individ- 
uals in the medically relevant gene GPI. MUC3A 
was also flagged with a whole-gene ampli- 
fication in all individuals, which we identified 
as residing within a falsely collapsed SD 
in GRCh38, further evidencing that find- 
ing (Fig. 1A). 

Seventeen medically relevant genes reside 
within erroneous duplicated and putative col- 
lapsed regions in GRCh38 (tables S1 and S3), 
including KCNE!I (false duplication) and KCNJI8 
(collapsed duplication) (Fig. 5, D and E). For 
these genes, we show that a significant skew in 
total variant density occurs in GRCh38 (58 
variants per kbp for eight genes in collapsed 
duplications and 21 variants per kbp for seven 
genes in false duplications; P = 5.684 x 10° 
and 6.195 x 10°+, respectively, Mann-Whitney 
U test) versus the rest of the 4909 medically 
relevant gene set (40 variants per kbp) that 
largely disappears in T2T-CHM13 (40 variants 
per kbp in collapsed duplications and 47 va- 
riants per kbp in false duplications versus 
41 variants per kbp for the remaining gene set; 
P = 0.8778 and 0.0219, respectively) (fig. S45). 
Examining KCNEI, we found that coverage is 
much lower than normal on GRCh38 for short 
and long reads and that most variants are 
missed because many reads incorrectly map 
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to a likely false duplication (KCNEIB on the 
p-arm of chromosome 21). The k-mer-based 
copy number of this region in all 266 SGDP 
genomes supports the T2T-CHM13 copy num- 
ber as well as its lack of duplication in GRCh37 
(23). As for KCNJ18, which resides within a 
GRCh88 collapsed duplication at chromosome 
17p11.2 (66), we found increased coverage and 
variants within HG002 using short- and long-read 
sequences in GRCh38 relative to T2T-CHM13. 
To verify whether the additional variants 
identified using GRCh38 are false heterozygous 
calls from PSVs derived from missing dupli- 
cate paralogs, we compared the distributions 
of MAFs across the 49-kbp SD. We observed a 
shift in SNV proportions, with a relative de- 
crease in intermediate-frequency alleles and 
a relative increase in rare alleles for KCNJI8 
and KCNJ12 (another collapsed duplication re- 
siding distally at chromosome 17p11.2) in T2T- 
CHM13 compared with GRCh38 (P = 8.885 x 
10°? and 3.102 x 10~, respectively; Mann-Whitney 
U test) (fig. S56). We matched the homologous 
positions of discovered alternative alleles in 
GRCh38 and T2T-CHM13 across the three 
paralogs—including the previously missing 
paralog located in a centromere-associated 
region on chromosome 17p KCNJ17, denoted 
KCNJ18-1 in T2T-CHM13—and observed that 
even true variants (i.e., non-PSVs) had discor- 
dant allele counts in KCNJI8 and KCNJ12 be- 
tween the two references (Fig. 5F and table 
S14). Considering that rare variants of KCNJI8 
contribute to muscle channelopathy-thyrotoxic 
periodic paralysis (66), including nine “patho- 
genic” or “likely pathogenic” variants in ClinVar, 
increased sensitivity to discover variants in 
patients using T2T-CHM13 would have a sub- 
stantial clinical impact. In summary, the im- 
proved representation of this gene and other 
collapsed duplications in T2T-CHM13 not only 
eliminates false positives but also improves de- 
tection and genotyping of true variants. 


Clinical gene benchmark demonstrates that 
T2T-CHMI13 reduces errors across technologies 


Finally, to determine how the T2T-CHM13 
genome improved the ability to assay variation 
broadly, we used a curated diploid assembly 
to develop a benchmark for 269 challenging 
medically relevant genes in GIAB Ashkenazi 
son HG002 (23), with comparable benchmark 
regions on GRCh38 and T2T-CHM13. We tested 
three short- and long-read variant call sets 
against this benchmark: I]lumina-BWAMEM- 
GATK, HiFi-PEPPER-DeepVariant, and ONT- 
PEPPER-DeepVariant. Counts of both false 
positives and false negatives substantially de- 
creased for all three call sets when using T2T- 
CHM13 as a reference instead of GRCh38 (Fig. 
5G and table S15). The number of false posi- 
tives for HiFi decreased by a factor of 12 in 
these genes, primarily because of the addition 
of missing sequences similar to KMT2C (fig. S15) 
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and removal of false duplications of CBS, CRYAA, 
HI19, and KCNEI (Fig. 5G). As demonstrated 
above, T2T-CHM13 better represents these 
genes and others for a diverse set of individ- 
uals, so performance should be higher across 
diverse ancestries. Furthermore, the number 
of true positives decreased by a much smaller 
fraction than the errors (~14%); this is due to a 
reduction of true homozygous variants caused 
by T2T-CHM13 possessing fewer ultra-rare and 
private alleles (Fig. 2G). This benchmarking 
demonstrates concrete performance gains in 
specific medically relevant genes resulting from 
the highly accurate assembly of a single genome. 


Discussion 


Difficult regions of the human reference 
genome, ranging from collapsed duplications 
to missing sequences, have remained un- 
resolved for decades. The assumptions that 
most genomic analyses make about the cor- 
rectness of the reference genome have contributed 
to spurious clinical findings and mistaken disease 
associations (67-70). Here, we identify variation 
in difficult-to-resolve regions and show that 
the T2T-CHM13 reference genome universally 
improves genomic analyses for all populations 
by correcting major structural defects and ad- 
ding sequences that were absent from GRCh38. 
In particular, we show that the T2T-CHM13 
assembly (i) revealed millions of additional 
variants and the existence of additional copies 
of medically relevant genes (e.g., KCNJ17) 
within the 240 Mbp and 189 Mbp of non- 
syntenic and previously unresolved sequence, 
respectively; (ii) eliminated tens of thousands 
of spurious variants and incorrect genotypes 
per sample, including within medically rele- 
vant genes (e.g., KCNJI8) by expanding 203 loci 
(8.04: Mbp) that were collapsed in GRCh38; (iii) 
improved genotyping by eliminating 12 loci 
(1.2 Mbp) that were duplicated in GRCh38; 
and (iv) yielded more comprehensive SV calling 
genome-wide, with an improved insertion/ 
deletion balance, by correcting collapsed tan- 
dem repeats. Overall, the T2T-CHM13 assembly 
reduced false positive and false negative SNVs 
from short and long reads by as much as a 
factor of 12 in challenging, medically relevant 
genes. The T2T-CHM13 reference also accu- 
rately represents the haplotype structure of 
human genomes, eliminating 1390 artificial 
recombinant haplotypes in GRCh38 that oc- 
curred as artifacts of BAC clone boundaries. 
These improvements will broadly enable future 
discoveries and refine analyses across all of 
human genetics and genomics. 

Given these advances, we advocate for a 
rapid transition to the T2T-CHM13 genome 
as a reference. Although we appreciate that 
transitioning institutional databases, pipelines, 
and clinical knowledge from GRCh38 to T2T- 
CHM13 will require substantial bioinformatics 
and clinical effort, we provide several resources 
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to advance this goal. On a practical level, im- 
provements to large genomic regions, such as 
entire p-arms of the acrocentric chromosomes, 
and the discovery of clinically relevant genes 
and disease-causing variants justify the labor and 
cost required to incorporate T2T-CHM13 into 
basic science and clinical genomic studies. On a 
technical level, T2T-CHM13 simplifies genome 
analysis and interpretation because it consists 
of 23 complete linear sequences and is free of 
“patch,” unplaced, or unlocalized sequences. 
Many of the corrections introduced by T2T- 
CHM13 were previously noted and addressed 
by the GRC as “fix patches,” but few studies 
use these existing resources. The reduced con- 
tig set of T2T-CHM13 also facilitates interpre- 
tation and is directly compatible with the most 
commonly used analysis tools. To promote 
this transition, we provide variant calls and 
several other annotations for the T2T-CHM13 
genome within the UCSC Genome Browser and 
the NHGRI AnVIL as a resource for the human 
genomics and medical communities. 

Finally, our work underscores the need for 
additional T2T genomes. Most urgently, the 
CHM13 genome lacks a Y chromosome, so our 
analysis relied on the incomplete representation 
of chromosome Y from GRCh38. A T2T represent- 
ation of the Y chromosome should further im- 
prove mapping and variant analysis, especially 
with respect to variants on the Y chromosome 
itself. Furthermore, many of the previously un- 
resolved regions in T2T-CHM13 are present in 
all human genomes and enable variant calling 
with traditional methods from short and/or long 
reads. However, many previously unresolved 
regions identified in the T2T-CHM13 genome 
exhibit substantial variation within and between 
populations, including satellite DNA (37) and 
SDs that are polymorphic in copy number and 
structure (32). Relatedly, the T2T-CHM13 ref- 
erence provides a basis for calling millions of 
variants that were previously hidden, but many 
of these variants are challenging to resolve ac- 
curately with current sequencing technologies 
and analysis algorithms. Robust variant call- 
ing in these regions will require many hundreds 
or thousands of diverse haplotype-resolved T2T 
assemblies to construct a pangenome reference, 
such as the effort now underway by the Human 
Pangenome Reference Consortium (56). These 
assemblies will then motivate further develop- 
ment of methods for discovering, represent- 
ing, comparing, and interpreting complex 
variation, as well as benchmarks to evaluate 
their respective performances (71, 72). 

Through our detailed assessment of variant 
calling across global population samples, our 
study showcases T2T-CHM13 as a preeminent 
reference for human genetics. The annotation 
resources provided herein will help facilitate 
this transition, expanding knowledge of human 
genetic diversity by revealing hidden functional 
variation. 
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Methods summary 

Haplotype structure 

We examined the impact of the fact that 
GRCh38 comprises a mosaic of clones derived 
from multiple donor individuals on its haplo- 
type structure. To this end, we searched for 
“LD-discordant” SNP pairs, defined as com- 
mon (>10% MAF) SNPs that segregate in per- 
fect LD (R? = 1) in the IKGP sample, but for 
which GRCh38 possesses a pair of alleles that 
are never observed together on a single phased 
haplotype among 1KGP samples (i.e., alleles in 
perfect negative LD). We then compared these 
results to the same analysis applied to each 
1KGP sample using a leave-one-out strategy. 


Duplication errors 


We flagged putatively collapsed duplications as 
regions >5 kbp containing clusters of heter- 
ozygous variants identified from two CHM13 
datasets [simulated Illumina-like reads from 
T2T-CHM13 reference v1.0 including the GRCh38 
Y chromosome and PacBio HiFi reads (73)] 
mapping against GRCh38 and T2T-CHM13 
references. False duplications were identified 
as regions, converted to T2T-CHM13 coordinates, 
with median read-depth copy numbers (32) lower 
in kmerized GRCh38 compared to kmerized 
T2T-CHM13 and 88% of SGDP individuals. 
Alternatively, false duplications were identified 
as regions >3 kbp with copy numbers greater in 
kmerized GRCh38 compared to kmerized T2T- 
CHM13 and 99% of SGDP individuals using a 
genome-wide sliding-window approach. 


Liftover of resources from GRCh38 
to T2T-CHM13 


Using the GATK release 4.1.9 (74) LiftoverVcf 
(Picard) tool, we lifted dbSNP build 154 (75), 
the March 8, 2021 release of Clinvar (76), and 
GWAS Catalog v1.0 (43) from the GRCh38 
assembly to the T2T-CHM13 assembly. Initial 
liftover was done with default LiftoverVcf 
parameters. A secondary round of liftover was 
performed to recover variants with swapped 
reference and alternative alleles between GRCh38 
and T2T-CHM13. We cataloged variants that 
failed to lift over because they overlap an indel 
that distinguishes T2T-CHM13 and GRCh38 
based on results from dipcall. 


Short-read variant calling 


To evaluate short-read small-variant calling 
between GRCh38 and T2T-CHM13, we used 
the NHGRI AnVIL (44) to align all 3,202 IKGP 
samples to CHM13 with BWA-MEM (45) and 
performed variant calling with GATK Haplo- 
typeCaller (77) using a workflow modeled on the 
one developed by the New York Genome 
Center (NYGC) for IKGP analysis performed 
on GRCh38 (28). As in the NYGC analysis, we 
recalibrated the variant calls with GATK 
VariantRecalibrator. We analyzed coverage sta- 
tistics using samtools and AF using bedtools. To 
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identify Mendelian-discordant variants, we used 
GATK VariantEval. 


Long-read variant calling 


To compare long-read mapping and large SV 
calling between T2T-CHM13 and GRCh38, 
we aligned HiFi and ONT data from 17 sam- 
ples of diverse ancestry to each reference with 
both Winnowmap (49) and minimap2 (50) and 
called SVs with Sniffles (53). Variant calls were 
refined with Iris, and HiFi-derived calls from 
both aligners were merged with Jasmine 
(54); the resulting sets of 124,566 SVs in 
GRCh38 and 141,193 SVs in T2T-CHM13 to 
compute AFs and other cohort-level statis- 
tics. In addition, we constructed trio-level 
callsets for two trios—the HG002 and HG005 
trios from the GIAB Consortium—to compare 
Mendelian discordance rates between the two 
references. 


Concordance of variants analysis across 
sequencing type 


To evaluate the variant calls in nonsyntenic 
regions, we derived concordance between var- 
iant calls generated with HiFi, ONT, and Illumina 
reads. For each sample, we used bcftools to filter 
the non-PASS variants, indels, and nonauto- 
somal variants from each callset. We then used 
hap.py (78) to derive the precision, recall, and F1- 
score between each variant callset to determine 
how many variants are common between each 
pair of sets. 


AF differentiation of nonsyntenic variants 


Using short read-based variant calls within 
T2T-CHM13 nonsyntenic regions, we searched 
for variants with signatures of extreme AF dif- 
ferentiation across human populations. We 
performed this analysis with Ohana (59), a 
method that infers admixture components 
for each sample and quantifies frequency var- 
iation among the components. For outlier non- 
syntenic variants with extreme patterns of AF 
differentiation, we used liftover to compare our 
results to previous results generated with IKGP 
phase 3 data aligned to GRCh38 (60). 


T2T-CHMI13 dipcall and Variant Effect 
Predictor (VEP) 


VEP (62) (version 102.0) was used to annotate 
variants generated by dipcall (67) when align- 
ing the T2T-CHM13 reference genome (chm13_ 
v1LO_plus38Y.fa) to the GRCh38 reference ge- 
nome (hg38.no_alt.fa). VCF files were annotated 
without the -filter_common and -canonical flags. 
CADD (79) v1.6 and raw SpliceAI (80) scores were 
added using both the CADD and SpliceAI plug- 
ins. Variants were filtered based on predicted 
HIGH functional impact. 


HGO002 medically relevant genes benchmark 


To evaluate variant call accuracy when using 
T2T-CHM13 vs. GRCh38 as a reference, we 


Aganezov et al., Science 376, eabl3533 (2022) 


developed equivalent small variant benchmarks 
for GIAB sample HG002 in 269 challenging, 
medically relevant genes. Methods were adapted 
from a companion manuscript that describes 
a curated benchmark for these genes created 
by using variants generated by dipcall (67) 
when aligning a trio-based hifiasm assembly 
to GRCh37 and GRCh38 (23). 
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INTRODUCTION: Large, high-identity duplicated 
sequences—termed segmental duplications 
(SDs)—are frequently the last regions of ge- 
nomes to be sequenced and assembled. While 
the human reference genome provided a road- 
map of the SD landscape, >50% of the remain- 
ing gaps correspond to regions of complex SDs. 


RATIONALE: SDs are major sources of evolu- 
tionary gene innovations and contribute dis- 
proportionately to genetic variation within and 
between ape species. With the complete human 
genome (T2T-CHM13), researchers have the po- 
tential to identify genes and uncover patterns of 
human genetic variation. 


A GRCh38 


RESULTS: We identified 51 million base pairs 
(Mbp) of additional human SD in T2T-CHM13 
and now estimate that 7% of the human ge- 
nome consists of SDs [(218 Mbp of 3.1 billion 
base pairs (Gbp)]. SDs make up two-thirds 
(45.1 of 68.1 Mbp) of acrocentric short arms, 
and these SDs are the largest in the human ge- 
nome (see the figure, panel A). Additionally, 
54% of acrocentric SDs are copy number var- 
iable or map to different chromosomes among 
the six individuals examined. A detailed com- 
parison between the current reference genome 
(GRCh38) and T2T-CHM13 for SD content 
identifies 81 Mbp of previously unresolved or 
structurally variable SDs. Short-read whole- 


T2T-CHM13 


T2T-CHM13 


More-complete segmental duplication content improves genotyping. (A) Increase (by a factor of 10) in the 
number of large (>10 kilo-base pairs) acrocentric segmental duplications (red) in T2T-CHM13 (right) compared 
with GRCh38 (left). (B) Copy number genotyping based on read-depth from 268 diverse human genomes 
across the globe shows that 90% of new SDs in T2T-CHM13 (red) are more likely to reflect human copy number 
when compared to GRCh38 (blue) irrespective of human population group considered. 
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genome sequence data from a diversity panel of 
268 humans show that human copy number is 
nine times (59.26 versus 6.55 Mbp) more likely 
to match T2T-CHM13 rather than GRCh38, in- 
cluding 119 protein-coding genes (see the fig- 
ure, panel B). Using long-read-sequencing data 
from 25 human haplotypes, we investigated pat- 
terns of human genetic variation identifying 
significant increases in structural and single- 
nucleotide diversity. We identified gene-rich 
regions (e.g., TBCID3) that vary by hundreds 
of kilo-base pairs and gene copy number be- 
tween individuals showing some of the high- 
est genome-wide structural heterozygosity (85 
to 90%). Our analysis identified 182 candidate 
protein-coding genes as well as the complete 
sequence for structurally variable gene models 
that were previously unresolved. Among these 
is the complete gene structure of lipoprotein A 
(LPA), including the expanded kringle IV re- 
peat domain. Reduced copies of this domain 
are among the strongest genetic associations 
with cardiovascular disease, especially among 
African Americans, and sequencing of multiple 
human haplotypes identified not only copy 
number variation but also other forms of rare 
coding variation potentially relevant to disease 
risk. Finally, we compared global methylation 
and expression patterns between duplicated 
and unique genes. Transcriptionally inactive 
duplicate genes are more likely to map to hy- 
pomethylated genomic regions; however, spe- 
cifically over the transcription start site we 
observe an increase in methylation, suggest- 
ing that as many as two-thirds of duplicated 
genes are epigenetically silenced. Additionally, 
SD genes show a high degree of concordance 
between methylation profiles and transcrip- 
tion levels, allowing us to define the actively 
transcribed members of high-identity gene 
families that are otherwise indistinguishable 
by coding sequence. 


CONCLUSION: A complete human genome 
provides a more comprehensive under- 
standing of the organization, expression, 
and regulation of duplicated genes. Our 
analysis reveals underappreciated patterns 
of human genetic diversity and suggests 
characteristic features of methylation and 
gene regulation. This resource will serve as 
a critical baseline for improved gene an- 
notation, genotyping, and previously unknown 
associations for some of the most dynamic 
regions of our genome. 
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Despite their importance in disease and evolution, highly identical segmental duplications (SDs) 

are among the last regions of the human reference genome (GRCh38) to be fully sequenced. 

Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive 
view of human SD organization. SDs account for nearly one-third of the additional sequence, 
increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis 
of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence 

(68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from 
human (n = 12) and nonhuman primate (n = 5) genomes, we systematically reconstruct the evolution 
and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis 
reveals patterns of structural heterozygosity and evolutionary differences in SD organization between 


humans and other primates. 


enomic duplications have long been 

recognized as important sources of 

structural change and gene innovation 

CZ, 2). In humans, the most recent and 

highly identical sequences [>90% and 
>1 kilo-base pairs (kbp)]—referred to as seg- 
mental duplications (SDs) (3)—promote mei- 
otic unequal crossover events that contribute 
to recurrent rearrangements associated with 
~5% of developmental delay and autism (4). 
These same SDs are reservoirs for human- 
specific genes that have been important in 
increasing synaptic density and the expansion 
of the frontal cortex since humans diverged 
from other ape lineages (5-8). SDs are also 
enriched ~10-fold for normal copy number var- 
iation, although most of this genetic diversity 
has yet to be fully characterized or associated 
with human phenotypes (9, 70). SD length 
(frequently >100 kbp), sequence identity, and 
extensive structural diversity among human 
haplotypes have hampered our ability to char- 
acterize these regions at a genomic level. This 
is because sequence reads have been insuffi- 
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ciently long and human haplotypes too struc- 
turally diverse to resolve duplicate copies or 
distinguish allelic variants. 

One of the first human whole-genome se- 
quence (WGS) assembly drafts created with 
Sanger sequencing technology was almost 
devoid of SDs and their underlying genes 
(11, 12). Similarly, bacterial artificial chromo- 
some (BAC)-based approaches to assembling 
the human genome from different haplotypes 
led to many misjoins, creating de facto gaps 
that took years to resolve (73). Although com- 
bining WGS- and BAC-based data for early 
sequencing of human genomes provided a 
roadmap of the SD landscape (74), more than 
50% of the gaps within the human reference 
genome have corresponded to regions of com- 
plex SDs. 

The development of genomic resources 
(15-17), including BAC libraries and long-read 
sequence data from complete hydatidiform 
moles (CHM, which represent a single human 
haplotype), was motivated in large part by ef- 
forts to resolve the organization of these regions 
and concomitantly complete the human refer- 
ence genome. The CHM13 cell line has the 
advantage of originating from a single haplotype 
and predominantly a single ancestral group 
(European) (78) in contrast to the GRCh38 
reference, which is a composite representa- 
tion of multiple human haplotypes and an- 
cestries (19). These resources, combined with 
advances in long-read technologies, have 
produced the gapless human genome assembly 
T2T-CHM13 (20). We use this genome assembly 
to present a complete view of SDs in a human 
genome and highlight their importance in 
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advancing our understanding of genetic diver- 
sity, evolution, and disease in humans. 


SD content and organization 


We characterized the SD content of the T2T- 
CHM13 v1.0 assembly by sequence read-depth 
and pairwise sequence alignments (>90% 
and >1 kbp) (27). Our analysis of the assem- 
bly identifies 208 Mbp of nonredundant 
segmentally duplicated sequences within 
chromosome-level scaffolds (including 15.6 Mbp 
of SD located on chrY, which is included from 
GRCh38), compared with just 167 Mbp in 
the current reference (GRCh38) (Table 1 and 
Fig. 1). This raises the percent estimate of the 
human genome that is segmentally duplicated 
from 5.4 to 6.7%. However, five SD-related 
gaps remained in the initial assembly of the 
female CHM13 genome (T2T-CHM13 v1.0). Each 
corresponded to a cluster of tandemly repeated 
ribosomal DNA (rDNA) genes on each acrocen- 
tric chromosome where we confirmed long-read 
sequence pileups consistent with unresolved 
SDs. The estimated amount of missing rDNA 
sequence was calculated by Nurk et al. by using 
both digital droplet PCR (22) and a whole- 
genome Illumina coverage analysis (20). Assum- 
ing a canonical repeat length of 45 kbp for 
the rDNA molecule (23, 24), the total amount 
of missing sequences was approximated at 
~10 Mbp and ~200 copies of unresolved rDNA 
sequence (20). These findings are consistent 
with the subsequent specialized assembly of 
the rDNA released as part of the T2T-CHM13 
v1.1 assembly. Including this estimate, the 
overall SD content of the human genome is 
now 7.0% (6.7% not including rDNA; see Table 
1 for statistics breakdown by SD type) and is 
likely to increase as more complete genomes of 
diverse origins are sequenced and assembled. 
One-third (81.3 Mbp) (25) of SD sequence 
in T2T-CHM13 is wholly uncharacterized in 
GRCh38 (16.5 Mbp) or differs in copy number 
and structure (64.8 Mbp) (25). Most of these 
involve large, high-identity SDs. For example, 
there is a 70% increase (41,285 versus 24,280) 
in the number of SD pairs and a doubling of 
the number of bases in pairwise alignments 
with >95% identity (Fig. 1C). Among these 
previously unresolved or variable SDs, 13,258 
(35.0 Mbp) map to the acrocentric short arms 
of chromosomes 13, 14, 15, 21, and 22 (Fig. 1B 
and Table 1), which are unassembled in the 
GRCh38. These SDs do not correspond to 
rDNA duplications but represent other seg- 
ments predominantly shared among acrocen- 
tric (n = 5332 alignments) and nonacrocentric 
chromosomes (7 = 5500 alignments; table S1). 
In particular, the pericentromeric regions of 
chromosomes 1, 3, 4, 7, 9, 16, and 20 show the 
most extensive SD homology with acrocentric 
DNA (Fig. 1B). Non-rDNA acrocentric SDs 
are 1.74 times as long as all other SDs (N50: 
74,704 versus 42,842) and significantly longer 
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a ee] 
Table 1. Summary statistics of segmental duplications in T2T-CHM13 and GRCh38. Mbp, number of nonredundant Mbp of SD; peri, within 5 Mbp of the 
heterochromatin surrounding the centromere; telo, within 500 kbp of the telomere; acro, within the short arms of the acrocentric chromosomes. Difference: 

SD content difference between T2T-CHMI3 v1.0 and GRCh38. Previously unresolved or structurally variable: Sequence in T2T-CHM13 that does not have 1 Mbp 
of synteny with GRCh38. GRCh38 contains 149,690,719 bp of gap sequence included in the reported number of Gbp. 


Assembly Gbp 


% SD SD (Mbp) #SDs 


inter (Mbp) # inter # intra 


intra 
(Mbp) 


acro 
(Mbp) 


telo 


Pee (Mbp) # telo 


# acro 
(Mbp) 


# peri 


Previously 
unresolved 


0.240 33.885 81.338 


or structurally 
variable 


25161 


61.873 20579 + 54.932 4582 


35.039 


13258 5.616 4005 


T2T-CHMI13 
v1.0* + rDNA 
estimate 


3.1144 6987 217.598 


66042 


131.148 49213 152.993 16829 


*The version of T2T-CHM13 that was used (v1.0) included chrY from GRCh38. 


(P value < 0.01, one-sided Wilcoxon rank-sum 
test) than any other defined SD category in the 
human genome (intrachromosomal, interchro- 
mosomal, pericentromeric, and telomeric; fig. S1). 

We annotated all T2T-CHM13 SDs with 
DupMasker (26), which defines ancestral 
evolutionary units of duplication on the basis 
of mammalian outgroups and a repeat graph 
(27). Focusing on duplicons that carry genes or 
duplicated portions of genes, we identified 30 
duplicons that show the greatest copy number 
change between T2T-CHM13 and GRCh38. 
These 30 genic SDs represent regions where 
gene annotation is most likely to change; all 
predicted differences favor an increase in 
copy number for the T2T-CHM13 assembly 
(Fig. 1D and table S2). 

We also compared the number of SDs more 
directly by defining syntenic regions (5 Mbp) 
between GRCh38 and T2T-CHMI13 (25). Of 
the 15 windows with the largest increase, nine 
mapped to the acrocentric short arms and six 
were in pericentromeric regions (fig. S1 and 
table S3). In particular, the intervals between 
the centromeric satellite and secondary con- 
strictions (qh regions) on chromosomes 1, 9, 
and 16 show a 4.6-fold increase in the number 
of SDs (5254 versus 1141) and the most dif- 
ferences in organization compared with GRCh38. 
SDs in these regions are almost exclusively inter- 
chromosomal and depleted for intrachromo- 
somal duplications (figs. S2 and S3). 


Validation and heteromorphic variation 


Because the acrocentric short arms as well as 
the gh regions on chromosomes 1, 9, and 16 
were either previously unresolved or showed 
the most considerable differences in terms of SD 
content, we focused first on validating their 
organization. We mapped available end-sequence 
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data from a human fosmid genome library 
(28) to the T2T-CHM13 assembly and selected 
nine distinct clones as probes (Fig. 2A) to con- 
firm the patterns of high-identity (>95%) SDs 
(25). All 30 of the distinct duplication predic- 
tions from the T2T-CHM13 SDs were corrobo- 
rated by fluorescence in situ hybridization (FISH) 
against chromosomal metaphases of the CHM13 
cell line (Fig. 2, B and C, and table S4). 

FISH also revealed nine additional signals 
not originally predicted by our SD analysis 
(fig. S4). However, we were able to identify 
lower identity duplications that confirmed 
seven of these sites, leading to an overall con- 
cordance of 95% (37 of 39) between FISH and 
the T2T-CHM13 SD assembly content. We ex- 
tended this analysis to five additional human 
cell lines of diploid origin because both peri- 
centromeric and acrocentric portions of chro- 
mosomes have been shown to be cytogenetically 
heteromorphic (29-37). In total, we identified 
61 distinct cytogenetic locations of which 28 
(46%) were fixed whereas 33 (54%) were variable 
in their presence or absence on specific homologs 
(both acrocentric and pericentromeric regions 
of the human genome) (fig. S4). Of the 61 FISH 
signals, all but six were observed in more than 
one of the six human cell lines indicating that 
such heteromorphic variation is common and 
prevalent. 

We found a correlation (Pearson’s correlation 
coefficient, 7 = 0.96) between genome-wide 
copy number variation from the assembly and 
Illumina read-depth data generated from the 
same CHM13 source (25). Because SDs fre- 
quently map to the breakpoints of inversion 
polymorphisms (28, 32, 33), we validated 65 in- 
versions relative to GRCh38 with single-cell DNA 
template strand sequencing (Strand-seq analysis) 
of the T2T-CHM13 assembly (figs. S5 and S6) 
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45.141 


38017 49738 10.975 4998 


(25). Although 32 of these represent known 
human polymorphisms, 33 have not been ob- 
served in six previously analyzed human ge- 
nomes (32). However, by analysis of Strand-seq 
data from one additional human haplotype 
(CHM1), we further confirmed 30 of these 
inversions (i.e., present in CHM1 and CHM13), 
suggesting that at least 95.4% (62 of 65) rep- 
resent true large-scale human inversion poly- 
morphisms (fig. $5). Consistent with previous 
literature (34), inversions associated with SDs 
(n = 30) are significantly longer than those not 
associated with SDs (P value < 0.01, one-sided 
Wilcoxon rank-sum test) and are polymorphic 
among humans (fig. S6). One notable example 
is an inversion polymorphism mapping to 
chromosome 1q21. It is a complex event 
consisting of two inversions (262.3 kbp and 
2.26 Mbp) originally predicted by Sanders 
and colleagues (33) but our sequence analy- 
sis shows a relocation of 767.6 kbp of genic 
sequences (Fig. 2D). The large inversion 
(chr1:146,350,000 to 148,610,000) is flanked by 
the core duplicon—the NBPF gene family—and 
in combination with the other rearrangements 
changes the order of human-specific genes 
NOTCH2NLA, -B, and -C, which have been 
implicated in expansion of the frontal cortex 
(8, 35). As a final test, we resolved this region 
in eight additional human haplotypes (25), all 
of which support the T2T-CHM13 configura- 
tion with one exception (CHM1), which is con- 
sistent with the GRCh38 configuration (fig. $7). 


Single-nucleotide and copy number variation 
within SDs 


The high quality and single haplotype nature of 
both the T2T-CHM13 and GRCh38 reference 
genomes provides an opportunity to compare 
the genome-wide pattern of single-nucleotide 
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Fig. 1. Segmental duplication (SD) content of the T2T-CHM13 genome. 

(A) The pattern of previously unresolved or structurally variant intrachromosomal 
duplications in T2T-CHM13 (red) compared with known duplications in GRCh38 
(blue-gray). These predict hotspots of genomic instability (gold) flanked by large 
(>10 kbp), high-identity (>95%), interspersed (>50 kbp) SDs. (B) Circos plot 
highlighting previously unresolved interchromosomal SDs (red) and showing the 
preponderance of previously unresolved SDs mapping to pericentromeric and 
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Duplicons associated with genes 


acrocentric regions. (C) A histogram comparing SD content in different human 
reference genomes, including the sum of bases in pairwise SD alignments 
stratified by their percent identity for the celera (yellow, Sanger-based); GRCh38 
(blue-gray, BAC-based); and T2T-CHM13 (red, long-read) assemblies. (D) The 
30 genic duplicons (ancestral repeat units) with the greatest copy number 
difference between GRCh38 and T2T-CHM13 as determined by DupMasker 
(table S2). The 30 largest differences all exhibit increase in T2T-CHM13. 


variation in regions that have typically been 
excluded from most analyses because of their 
repetitive nature. We aligned GRCh38 to T2T- 
CHM13 and retained only regions deemed to 
be “syntenic” on the basis of an unambiguous 
one-to-one correspondence between both ref- 
erence genomes and at least 1 Mbp of aligned 
sequence (25). 

Most unique regions of the genome 
(2693 Mbp) could be compared, whereas only 
60% (124: Mbp) of the SDs within T2T-CHM13 
exhibited a clear orthologous relationship be- 
tween the two human reference genomes. As 
expected, the X chromosome and the region 
corresponding to the major histocompatibility 
complex (MHC) are the least and most diver- 
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gent, respectively (Fig. 3A), as a result of the 
slower rate of evolution for the female X and 
the deep coalescence of MHC. 

Notably, SD sequences are significantly more 
diverged than unique sequences (P value < 
0.001, one-sided Mann-Whitney U test) (fig. 
S8). Comparing only syntenic regions (25) be- 
tween GRCh38 and T2T-CHM13, we estimate 
the single-nucleotide variant (SNV) density to 
be 0.95 SNVs/kbp for unique regions of the ge- 
nome when compared with SD regions where 
density rises to 1.47 SNVs per kbp (table $5). This 
50% increase could be a result of an increased 
mutation rate of SDs (e.g., a result of the action 
of interlocus gene conversion), or a deeper aver- 
age coalescence of duplicated sequences. Another 
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possible explanation for this observation is 
erroneous alignment of paralogous instead 
of allelic sequences; however, we believe this 
is unlikely given the requirement of at least 
1 Mbp of continuous, one-to-one, best alignment 
between GRCh38 and T2T-CHM13 (25). 

As part of this analysis, we also identified 
regions that structurally differ or are absent 
from GRCh38 when compared with the T2T- 
CHM13 assembly. Using 1-Mbp LASTZ align- 
ments (25), we identified 126 nonsyntenic 
regions for a total of 240 Mbp (N50 length of 
12.7 Mbp; fig. S9). Of these, 33.9% (81.34 of 
240 Mbp) overlapped SD regions. Using sequence 
read depth (25) from 268 human genomes 
[Simons Genome Diversity Project (SGDP)], 
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Fig. 2. Validation of previously unresolved SDs in T2T-CHM13 and hetero- 
morphic variation. (A) Ideogram (top) depicts large SD regions (light red boxes) 
present in T2T-CHM13 but absent from the current reference human genome 
(GRCh38). An expanded view of the duplication (red) and satellite organization 
(blue) are depicted below showing the location of fosmid FISH probes (e.g., C15) 
and SD organization compared with ancestral duplicon segments (inset, 
multicolored bars). (B and C) FISH signals (red) shown on extracted metaphase 
for two probes and three human cell lines. Probe K20 shows a fixed signal 


(except for one heterozygous signal), and G6 is heteromorphic among 

humans (see table S4 and fig. S4 for complete descriptions of all nine probes). 
(D) Inversion polymorphism (green bar) between T2T-CHM13 and GRCh38 

in the pericentromeric chromosome 1q region. The inversion (green bar) 
identified by Strand-seq (32) is confirmed in the assembly; however, the 
sequence-resolved assembly shows a more complex structure including two 
inversions (red) and one reordered segment (blue) mapping near the 
NOTCH2NL human-specific duplications. 


we compared the copy number of both T2T- 
CHM13 and GRCh38 (36), successfully gen- 
otyping 1292 distinct copy number variable 
regions (74.85 Mbp). We found that T2T-CHM13 
approximated the median (+2 SD) human 
copy number from SGDP for 94% of bases 
(70.6 Mbp) in contrast to GRCh38 in which 
57% of bases (42.8 Mbp) meet this metric 
(fig. S10). In particular, the copy number for 
humans is nine times (59.26 versus 6.55 Mbp) 
more likely to more closely match the T2T- 
CHM13 copy number than that of GRCh38 in 
nonsyntenic SD regions (Fig. 3B). Thus, T2T- 
CHM13 is a better predictor (AUC 0.91) than 
GRCh38 (AUC 0.77) of human copy number 
variation and better approximates an in silico 
human reference constructed with the median 


Vollger et al., Science 376, eabj6965 (2022) 


copy number of the SGDP samples at every 
site (AUC 0.96; Fig. 3C). In nonsyntenic SD 
regions, GRCh38 tends to underestimate nor- 
mal human copy number by an average of 9.2 
copies or a median of 3.0 copies. 

We identify 119 protein-encoding genes 
where the T2T-CHM13 copy number better rep- 
resents the true human copy number state. 
By contrast, only 65 genes are better repre- 
sented by GRCh38 (table S6). These include 
both biomedically important genes relevant to 
disease risk (LPA and MUC3A) (37-43) as well 
as gene families implicated in the expansion 
of the human brain during human evolution 
(TBC1D3, NPIP, and NBPF) (Fig. 3D and table 
S6) (7, 44-46). In T2T-CHM13 for example, 
there are additional copies of NPIP, NBPF, and 
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GOLGA that are absent from GRCh38, and 
each of these has been described as a core 
duplicon responsible for the expansion of in- 
terspersed duplications in the human genome 
(27) as well as the emergence of human- 
specific gene families. 

Notably, African genomes tend to have an 
overall higher copy number status when com- 
pared with non-African genomes. In particu- 
lar, TBCLD3 shows about seven fewer copies in 
non-Africans when compared with Africans 
(P value < 1 x 10~"”). These findings suggest 
that higher copy number is likely ancestral 
(table S7) and that T2T-CHM13 once again better 
captures that diversity. Despite its primarily 
European origin, our results show that the more 
complete genome assembly serves as a better 


4 of 12 


RESEARCH | COMPLETING THE HUMAN GENOME 
A = NonSD — SDs — chrX — MHC B - 59.26 Cc = T2T-CHM13 == GRCh38 == Median CN of S@DP 
1.00 oe 
S 9504 
Q 3 2 0.254 
a 28 0.00 
© 075 £5 o 0 1000 1500 
ae ” © 40 o 
Q 2o 
2 af Zz 
3 = 2 4004 
= oO 9 
cs) a2 ® 
0.50 O38 5 
xe) ES 2 0754 
[s) oo fo} 
g a c 
Q £820 2 
= af 8 0.504 
S 0.25 H,:Non SD < SDs 25 = 
= Pvalue < 1e-04 5 a > Reference AUC 
rs) are 8.18 Bee § 025] T2T-CHM13. 0.91 
é E GRCh38 0.77 
0.00 ° Median CN of SGDP 0.96 
| 1 | 1 | 1 | L (0) 0.00 


0 0.10 1.00 10.00 
% divergence of 10kbp windows (1kbp slide) aligned 
from GRCh38 to T2T-CHM13 


GPRIN2 


chr10:48,671,764-48,719,220 


T2T-CHM13 Equal CN GRCh38 Neither 


NPIP. 
chr16:21,340,301-21,356,195 


0 10 20 30 
Allowed CN difference between 
the sample and the reference 


TBC1D3 
chr17:38,963, 566-38, 978,236 


1.04 1 1.04 
! 
African Non African 0.5 4 1 0.54 
@ r ) 0.0; —— 0.04 
f : : 
#SGDP samples > 2 3 4 5 6 
© 50 5 FRG2C 
sg chr1:143,625, 171-143,638,407 chr22:9,348,321-9,367,761 
S 400 5 1.04 1.04 1 1.04 
e 1 
£05 0.54 0.54 
Ow | 3 
> 0.04 0.04 0.04 
8 f : : 
Reference CN a 6 8 10 12 
| T2T-cHMi3) | -@ 
1 GRCh38 7p) 
1.04 1.04 1.04 
Better reference 
T2T-CHM13 0.51 0.54 054 
Equal CN 0.04 0.04 0.04 
GRCh38 


Fig. 3. SD single-nucleotide and copy number variation. (A) Sequence 
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on the basis of syntenic alignments between 
ed), and unique genomic regions (black). SD 
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e X (blue) but less than that of the MHC 


or T2T-CHM13 as a function of the allowed difference between sample and 
reference copy number. The inset shows the area under the curve (AUC) 
calculation for both references allowing a maximum copy number difference of 
30. The green curve shows an in silico reference made using the median copy 


regions (green). (B) Copy number of SD regions that were previously unresolved 
or structurally different in T2T-CHM13 compared with GRCh38 on the basis of 
268 human genomes from the Simons Genome Diversity Project (SGDP). The 
histogram shows the number of Mbp, in which more samples support the copy 
number of the given assembly [T2T-CHM13 (red), GRCh38 (blue), neither 
(green), or both equally (equal copy number, gray)]. (©) Empirical cumulative 


reference for copy number variation irrespective 
of population group (fig. S11). 


Structural variation and massive evolutionary 
changes in the human lineage 


Advances in long-read genome assembly (47, 48) 
enable sequence resolution of complex struc- 
tural variation associated with SDs at the 
haplotype level (49). We generated or used 
existing high-fidelity (HiFi) sequence data 
from 12 human and five nonhuman primate 
genomes to understand both the structural 
diversity and evolution of specific SD regions. 
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To guide the selection of candidate regions for 
analysis, we constructed a hifiasm assembly 
of a chimpanzee genome (note that the cell 
line used to make the sequencing data for 
this genome assembly was originally made 
from a now deceased chimpanzee individ- 
ual, Clint), compared it with the T2T-CHM13 
assembly, and searched for regions of substan- 
tial structural difference between the lineages. 
We focused first on the largest regions of in- 
sertion on the human lineage before sub- 
selecting those regions that contain genes of 
biomedical or evolutionary importance (tables 
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number of the SGDP samples at each site. (D) Genic copy number variation. 
Copy number variation in nine gene families are shown (generated with SGDP) 
and distribution is colored according to which reference better reflects the 
median copy number; GRCh38 generally underestimates copy number (vertical 
lines) and Africans (orange) tend to show higher copy number than non-Africans 
(blue); circle size indicates number of samples. 


S8 and S9). We restricted the analysis to in- 
sertions >50 kbp in length and selected 10 loci 
for a more detailed analysis, including genes 
associated with the expansion of the human 
frontal cortex (tables S8 and S9 and fig. S12). 
Assemblies of additional haplotypes recapitu- 
lated the structural organization of T2T-CHM13 
for eight of the 10 loci, whereas evidence for 
the structural organization of GRCh38 was only 
found in five of the 10 loci (25). Overall, 73% of 
the human haplotype assemblies were success- 
fully reconstructed (table S8); however, the 
fraction of human haplotypes resolved at each 
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Fig. 4. Human-specific expansion of TBC1D3 compared with nonhuman 
primates. (A) Regions of homology between human T2T-CHM13's chromosome 
7 (top) and a HiFi assembly of the chimpanzee genome (bottom). Red blocks 
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independent expansions in macaques, gorillas, and orangutans. Using a 
macaque sequence as an outgroup, we estimate the human-specific 
expansion to be ~2.3 million years ago (MYA). (D) Variation in human haplotypes 


epresent regions of human-specific expansion, including TBC1D3 duplica- 
tions. Colored arrows above and below the homologous sequence represent 
unique ancestral units (duplicons) identified by DupMasker. Inset plots for 
both expansion sites are included below with the gene models identified by 
Liftoff (94). (B) Copy number (diploid) estimates from an Illumina read-depth 
analysis of SGDP, ancient hominids, and nonhuman primates for a TBC1D3 
paralog (table S14). Copy number estimates include five pseudogenes not 
included in the phylogeny, explaining the higher counts observed. The 
T2T-CHM13 copy number and GRCh38 copy number are represented by red 
and blue lines, respectively. (©) Phylogeny of TBC1D3 copies at these two 
expansion sites as well as nonhuman primate copies. Single asterisks at 
nodes indicate bootstrap values =70% whereas double asterisks indicate 
100%. The data illustrate a human-specific expansion as well as several 


across the first TBC1D3 expansion site: a graph representation (rGFA, left) 

of the locus where colors indicate the source genome for the sequence, and 

on the right the path for each haplotype-resolved assembly through the graph. 
The top row for each haplotype composed of large polygons represents an 
alignment comparing the haplotype-resolved sequence (horizontal) against the 
graph (vertical), and color represents the source haplotype for the vertical 
sequence. For example, a single large red triangle indicates there is a one-to-one 
alignment between T2T-CHM13 and the haplotype. Structural variants can be 
identified from discontinuities in height (deletion), changes between colors 
(insertion), or changes in the direction of the polygon (inversion). Below is 
shown the gene of interest (red arrow) and other genic content in the region 
(black arrow). Colored bars show ancestral duplication segments (duplicons) that 
compose the larger duplication blocks. 


locus varied considerably depending on the 
size and complexity of the region (fig. $13). For 
example, in the case of the 8.9-Mbp region cor- 
responding to NOTCH2NL and SRGAP2B/2D, 
we recovered only 37.5% of human haplotypes 
(table S8 and fig. S7). Similarly, we resolved 
only six haplotypes (from a potential of 24 
haplotypes) for the 3.4-Mbp region harboring 
the SMNI and SMN2 loci (fig. S14). 

Among the haplotypes that could be re- 
solved, we found a high degree of structural 
heterozygosity among human genomes (25) 


with 249 kbp differing on average when com- 
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pared with T2T-CHM13 (table S10). In some 
cases the structural changes are simple, such 
as ~12 kbp insertion or deletion of CYP2D6, 
which contributes to differential drug metab- 
olism activity in addition to other human dis- 
ease susceptibilities (50-56) (fig. S15). In other 
cases the patterns of structural variation are 
complex, involving hundreds of kilobase pairs 
of inserted or deleted gene-rich sequences along 
with large-scale inversion events that alter 
gene order for specific human haplotypes (see 
ARHGAPIIA/B; fig. S16, and NOTCH2NLA/B; 
fig. S7). Furthermore, the spinal muscular atrophy 
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(SMA) locus containing SMNI and SMN2— 
one of the most difficult regions to finish as part 
of the Human Genome Project on chromosome 
5 (57)—shows a distinct structure for all seven 
assembled haplotypes including GRCh38. Some 
haplotypes not only show increases in SMN2 
copy number (fig. $14)—a genetic modifier of 
SMA (58)—but also potential functional differ- 
ences in the organization and composition of 
SMN2. Because SMN2 serves as a target for 
small-molecule drug therapy that improves 
splice-site efficiency compensating for the 


loss of SMN1I in SMA patients (59), this level of 
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sequence resolution suggests practical utility 
for disease risk assessment and treatment of 
patients. 

Of particular interest is the TBCID3 gene 
family (44) (Fig. 4 and figs. S17 and S18), the 
protein products of which modulate epidermal 
growth factor receptor signaling and trafficking 
(60); further, their duplication in humans has 
been associated with expansion of the human 
prefrontal cortex as evidenced by mouse 
transgenic experiments (7). A comparison with 
chimpanzees (Fig. 4A) shows two massive 
genomic expansions in the human lineage 
(323.0 and 124.4 kbp). Both the high sequence 
identity (99.6%) and sequence read-depth com- 
parisons of TBCI1D3 copy number are con- 
sistent with expansion occurring in the human 
lineage after divergence from chimpanzees 
(Fig. 4B). 

We extended this analysis to other non- 
human primates by generating HiFi assem- 
blies for bonobos, gorillas, orangutans, and 
macaques. We identified TBCID3 homologs 
in each species and constructed a maximum 
likelihood phylogeny by using intronic or 
noncoding sequences flanking the gene (Fig. 
4C). The analysis reveals recurrent and inde- 
pendent expansions of TBC1D3 in orangutan, 
gorilla, and macaque species at different time 
points during primate evolution, with the most 
recent expansions occurring 2 million and 
2.6 million years ago. However, these estimates 
assume that there has not been substantial 
interlocus gene conversion, which may not be 
the case. 

Complete sequencing of human TBCID3 
haplotypes reveals notable structural diversity 
(Fig. 4D) with TBCID3 copy number ranging 
from three to 14 TBCID3 copies at expansion 
site 1 and two to nine copies at expansion site 
2. In total, approximately one-third of human 
expansion site 2 shows large-scale structural 
variation, and we identify >1.8 Mbp of dupli- 
cated sequence and >650 kbp of inverted 
sequence across the 18 haplotypes (including 
GRCh38). We estimate the structural hetero- 
zygosity of this locus to be 90.1% with 14 of 18 
haplotypes showing structurally distinct dup- 
lication configurations (fig. S18). Similarly, 
TBCLD3 expansion site 1 is 87.6% heterozygous 
with 14 of 22 haplotypes displaying unique 
structures corresponding to copy number dif- 
ferences in the TBCID3 gene family (fig. S17). 
Using orthogonal Oxford Nanopore Technol- 
ogies (ONT) ultra-long-read sequencing, we 
validated these complex patterns of structural 
variation in a subset of the samples inves- 
tigated here (25) (figs. S19 and S20). To bet- 
ter represent the structural genetic variation 
at this locus, we used a graph-based repres- 
entation (67), which identified two TBCID3 
genes as common among all human haplo- 
types examined thus far (TBCID3B at site 1 and 
TBCIDSA at site 2). 
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Additional gene models and variable 

duplicate genes 

We identified 182 candidate genes that were 
previously unresolved or nonsyntenic (25) in 
the T2T-CHM13 genome assembly (compared 
with GRCh38) with open reading frames and 
multiple exons (table S11). Of these, 91% (166) 
corresponded to SD gene families (Fig. 5A). 
Many of these represent expanded tandem 
duplications (e.g., GAGE gene family members 
on the X chromosome) or large interspersed 
duplications (e.g., beta-defensin locus) adding 
additional copies of nearly identical genes to 
the human genome (Fig. 5A). 

We searched for evidence that these copy 
number polymorphic or structurally variant 
regions were transcribed by aligning long- 
read transcript sequencing data and search- 
ing for perfect matches (25). We constructed 
a database of 44.2 million full-length cDNA 
transcripts derived from 31 human tissue 
samples and compared them with both the 
GRCh38 and T2T-CHM13 human genome re- 
ferences. For those 182 previously unresolved 
protein-coding genes where an unambiguous 
assignment could be made, 36% (65 of 182, 
>20 Iso-Seq reads) were confirmed as expressed, 
and 23 of them showed that more reads mapped 
better to T2T-CHM13 when compared with 
GRCh38 (Fig. 5B). 

Overall, across the entire genome 12% of 
full-length transcripts exhibited at least 0.2% 
higher alignment identity when mapped against 
T2T-CHM13, whereas 8% aligned better to 
GRCh38. These results are consistent with 
the notion that T2T-CHM13 is more complete, 
but that in some cases both assemblies capture 
different structurally variant haplotypes asso- 
ciated with genes. In addition to entirely new 
genes, we identified several gene models that 
were previously incomplete—many of which 
encode proteins with large tandem repeat do- 
mains (ZNF, LPA, Mucin; Fig. 5C). Among these 
is the complete gene structure of the kringle IV 
domain of the lipoprotein A gene. Reduced 
copies of this domain have some of the strongest 
genetic associations with cardiovascular disease, 
especially among African Americans (37-40, 62). 
Sequencing of multiple human haplotypes not 
only identified length variation but also other 
forms of rare coding variants potentially relevant 
for disease risk (Fig. 5D). 


SD methylation and transcription 


Because methylation is an important consid- 
eration in regulating gene transcription, we 
took advantage of the signal inherent in ultra- 
long ONT data (63-65) to investigate the CpG 
methylation status of SD genes within the 
CHM13 genome (25). Using hierarchical cluster- 
ing, we found that SD blocks are generally either 
methylated or unmethylated as an entire block; 
(fig. S21 and Fig. 6A). Specifically, we found that 
452 SD blocks (127.7 Mbp) flanked by unique 
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sequences are hypermethylated in contrast 
to 222 hypomethylated SD blocks (52.1 Mbp). 
Methylation status does not appear to be 
driven by genomic location, e.g., proximity to 
centromeres, acrocentric short arms, or telo- 
meres (Fig. 6A). 

Using full-length transcript data from CHM13, 
we compared methylation and transcription 
status of duplicated genes (25). If we stratify 
genes by their number of full-length transcripts, 
we observe distinct methylation patterns for 
transcribed and nontranscribed SD genes (Fig. 
6B). For highly transcribed SD genes (genes 
without at least one exon overlapping with SD 
sequence) and unique genes, the gene body 
and flanking sequence are generally hyper- 
methylated with a pronounced dip near the 
transcription start site (TSS) and promoter (66). 
By contrast, nontranscribed genes show mod- 
erate to low methylation across the gene body 
and flanking sequence. 

Restricting the analysis to genes mapping 
within SDs, we find that transcriptionally si- 
lenced duplicate genes are more likely (10,000 
permutations, P = 0.0018) to map to hypo- 
methylated regions of SD sequences (Fig. 6A) 
when compared with transcribed duplicate 
genes. Additionally, in untranscribed SD genes, 
we observe a statistically significant (P value < 
0.001, one-sided Mann-Whitney U test) in- 
crease in TSS methylation (6.6% increase) 
when compared with unique genes where the 
TSS is more likely to be depleted for methyl- 
ation (8.2% decrease). 

One important consideration in this analysis 
is the presence of a CpG island within 1500 bp 
of the promoter (67). In our analysis of CHM13, 
for example, unexpressed unique genes have a 
low CpG count, consistent with a lack of CpG 
islands (fig. S22). If we repeat the same analysis 
on SD genes, we find that the unexpressed SD 
genes exist with and without CpG islands 
(fig. S22). In total, these observations suggest a 
process of epigenetic silencing for a subset of 
duplicate genes through general demethylation 
of the gene body but hypermethylation of 
promoter regions. On the basis of these ob- 
served signatures, we investigated whether these 
epigenetic features coincided with actively tran- 
scribed members of duplicate gene families. 

We investigated a recently duplicated hom- 
inid gene family (NVPIPA) (68) where sufficient 
paralogous sequence differences exist to unam- 
biguously assign full-length transcripts to spe- 
cific loci. Although promoter and TSS signatures 
are less evident at the individual gene level, the 
gene body methylation signal appears diag- 
nostic (Fig. 6C). NPIPAI and NPIPAQ, for 
example, are the most transcriptionally active 
and show demonstrably distinct methylation 
patterns providing an epigenetic signature to 
distinguish transcriptionally active loci asso- 
ciated with high-identity gene families that 
are otherwise largely indistinguishable. We 


7 of 12 


A sei B ; _ 
. . —j—1PeP 1 y; ' NBPF1 Bex it 
>RAMEF 18 
cal ——'TPTE_2 29 ——TPTE 
BEERS _-Aepoz0n2~ ‘ pk GPRIN2_2 a 
NAMES uN ees, / ‘= 
NPs} 1M oc tre ee is SULTIA1 i: 49 
qq ay hro IPIPB7 
ais Ss NpHpr ches chr PANKRO20A4P chri6 es FCGR3B_2 3 
“BOLAZB 2 ——"FRG2C_2 ag te 
chrt3}] d A rorerc2 SULT1A2 
L. \ = 
L__erk1 LI Ne GAGE12H H 
= CLEC18B” 
5 =. -OTEH 
Wl Myrna? ease ®  FAM86B2 % 
DNS = AEGIS, 5 
AMY FA ——"GPRIN2_2 iY oO TPTE 2 £ 
chrt IBPF4 hrta| iB SLGALSIC_2 *GAGE2A 3 = 
WBPFE chia chr? on chr17| SRENe 2 BNee ibe 2 TPTE ? 
__—-NBPF15 “CCL3L1~-2 in GAGET © — 
SG Bliua_2 ecraz2 =| | Beton 4 
conga TNK2 TBCIDIR U<saeisee §=— ANKRD20A4P 
AI CGREC U LI GAGE 124 
<= | eft, L—ascary —_ ff] 
FORRES issue | eS ae ted phe s 
= sry " TSH ope 
Seayayre Jin py NEES te TPPP ph 2 
ia “ | “AM236C 
Wettig x4 py ——oR9G1,2 MEAS se : TNK2 "i 
chr4 VRPT 2? oneg [|= eAt2 chris Eee. chr1g| Tr) \pHeawi 
[| Seite L cna t2 EXOC3 a 
teal H QO 
uU UL Wels U INS ll cent -6 -4 -2 0 2 4 6 
ou Differential percent identity 
of cDNA alignment to T2T-CHM13 vs. GRCh38 
Cc ea aici D HG03125 (paternal) 
gene models KIV-1 DCYHGDGQSYRGTYSTTVTGRTCQAWSSMTPH TTENYPNAGLIMNYCRNP AAPYCYTRDPGV NUT OR, 
CN CVD risk (allele) HE Nomd [Elevated jj Unknown (NHP) i A a =6sP re Pe) 6k 60 5 
CHM13 23} I HHH} Hl We i 2 
GRCh38chrOnly 6 } SUELO ——--—} KIV-2 + 5 
CHM1 8 | Sa einaiies ieee aaieeReEeemeeemnaatemomee eee KIV-2 a va 8 
NA12878 pat 13} SEE AnInnnE a LL ——— KIV-2 me | 5 
NA19240 pat 13} oe oe KIV-2 coe b es BD bs} 
HG01109 mat 13} ——}-#-HHH HHH HHH HHH HHH ieee i oe a ‘fO 
KIV-2DE. . oa oe i 
HG03492 mat 15} ne + RVD 2 
NA12878 mat 17 + tt —— QEE.. om os 2 
Wei oe ee Ss =k ees ol =| by 
HG02723 pat 18} (a tt KIV-2 E.. te be =.) 
NA19240 mat 18 + pt —————| KV-2 € . 5 
HG03492 pat 18 } a a isiV-2 Be hy 
KIV-2F E 
HG02723 mat 19} tt ——— RV OEE ie D 
HG00733 pat 19 } FH HHH HHH HH HHH} KIV-2 £. Lbeistuict nt Os we Lean oe ‘2 
HG03125 mat 19} FH HHH HHH HHH} Ht Kiv-2  . us a 
HG03125 pat 22 — HHH HEH HHH H$ $I dl KIV-2 ie es os) 
HG00514 pat 22 HHH HEHEHE HH HH} $< H SHHH KIV-2 reeks + “S 
HG01243 pat 22 — HEHEHE HHH HH HH} $< tt HHH Ava ae | - ‘»D 
HG01109 pat 23 ve 2 Eee eee =6hlh ke or =| 2 
HG00733 mat 23 -——+H-H-H EHH HEH HH HHH HH HHH HHH HHH $$} HHH KiV-4 Eo... VB tis. 2 
HG02080 pat 24 | et HHH} KIV-5 TR. mee : iS 
HG02080 mat 25 tt HHH KIV-6 5... Ts. WB. : 5 
Clint PTR pri 7 HHH HHH HH HHH HHH HHH} A ee TR. Be BO 
lint PTR at tt Wee oe fo 
Missi PHO) San PRU Un RU Sauna KiV-10 OM. Nl. lees colle, i a) eee >) 
wudiblu PPA alt 13 } THE HH HHH tH tH HHH |" disulfide bonds | l 
0 50,000 100,000 150,000 200,000 250,000 \ 


Fig. 5. Genic variation in previously unresolved SD regions of T2T-CHM13. 
(A) Ideogram showing the previously unresolved or nonsyntenic gene models 
[open reading frames (ORFs) with >200 bp of coding sequence and multiple 
exons] in the T2T-CHM13 assembly as predicted by Liftoff. Previously unresolved 
genes mapping to SDs (red) are indicated with an asterisk if predicted to be an 
expansion in the gene family relative to GRCh38 (25). Arrows indicate inverted 
regions. Most unique genes mapping to nonsyntenic regions (black) are the 
result of an inversion (arrow). (B) Percent improvement in mapping of CHM13 


regions of the T2T-CHM13 assembly. Positive values identify Iso-Seq reads 
aligning better to T2T-CHM13 than GRCh38. (C) Gene models of LPA with ORF 
generated from haplotype-resolved HiFi assemblies. The double-exon repeat in 
these gene models encode for the kringle IV subtype 2 domain of the LPA 
protein. Highlighted in red are haplotypes with reduced kringle IV subtype 2 
repeats predicted to increase risk of cardiovascular disease. (D) Amino acid 
variation in the kringle IV subtype 2 repeat in the paternal haplotype of HGO1325 
identifies a previously unknown set of amino acid substitutions including rare 


Iso-Seq reads in candidate duplicated genes (red) mapping to nonsyntenic 


show that this trend also holds for other gene 
families with high copy numbers (fig. $23). 


Discussion 


This work provides a comprehensive view of 
the organization of SDs in the human genome. 
The T2T-CHM13 reference adds a chromo- 
some’s worth (81 Mbp) of SDs, increasing the 
human genome average from 5.4 to 7.0% 
and nearly doubling the number of SD pair- 
wise relationships (24 thousand versus 
41 thousand) and thus predicts regions of 
genomic instability as a result of their po- 
tential to drive unequal crossing-over events 
during meiosis. 

By every metric, T2T-CHM13 improves our 
representation of the structure of the human 
genome. This includes sequence-based orga- 
nization of the short arms of chromosomes 
13, 14, 15, 20, and 21, where we find that SDs 
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account for more sequence (34.6 Mbp) than 
either heterochromatic satellite (26.7 Mbp) or 
rDNA (10 Mbp). Acrocentric SDs are almost 
twice as large when compared with non- 
acrocentric regions likely because of ectopic 
exchange events occurring among the short 
arms, which associate more frequently during 
the formation of the nucleolus (69). 

Notably, nearly half of the acrocentric SDs 
involve duplications with nonacrocentric peri- 
centromeric regions of chromosomes 1, 3, 4, 7, 
9, 16, and 20. These duplicated islands of 
euchromatic-like sequences within acrocentric 
DNA are much more extensive than previously 
thought but have been shown to be transcrip- 
tionally active (70). Although the underlying 
mechanism for their formation is unknown, it 
is noteworthy that three of the nonacrocentric 
regions have large secondary constriction sites 
(chromosomes 1q, 9q, and 16q) composed 
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variants: Ser42Leu in the active site, Ser24Tyr, and Tyr49Cys. 


almost entirely of heterochromatic satellites 
(HSAT2 and HSATS3) (fig. S2). These particular 
SD blocks thus are bracketed by large tracts of 
heterochromatic satellites, and such configu- 
rations may make them particularly prone to 
double-strand breakage events (77) promoting 
interchromosomal duplications (fig. S3) between 
acrocentric and nonacrocentric chromosomes. 
The T2T-CHM13 reference, along with re- 
sources from other human genomes, provides 
a baseline for investigating more complex 
forms of human genetic variation. For exam- 
ple, this complete reference sequence facili- 
tates the design of sequence-anchored probes 
to systematically discover and characterize SD 
heteromorphic variation where chromosome 
organization differs among individuals (Fig. 2). 
Such chromosomal heteromorphisms have been 
traditionally investigated cytogenetically and are 
thought to be clinically benign (29-37). However, 


8 of 12 


RESEARCH | COMPLETING THE HUMAN GENOME 
A Hi Methylated JJ Unmethylated J) Centromere B 
1.00 4 \ Untranscribed [0,1) \ 
chri [II I in] 
chr2 I J 
chia [ Ce I ue 
chré [LT | L ] | | 
chr L_I_T a | 0 0.50 + ! 7 
chr6é i 
chr7 Of Li 
chr8 [IT I I 0.254 
chro im sil 
chri0 Bil 
chrt1[ i T 0.00 4 
chri2E] I = i =, 
chr13 LO 
chri4 i] e 1.005 Median 
chr15 [T_T 2 methylation 
chr16 (TM Miia TIT 1) © 0.754 i 
chr17 (LTT ] & — unique 
chr18 s ~~ SD 
chr19 i = 0.50 7~ ' - 
chr20 > | Methylation 
hr21 = y 
oe AAT i 3 0.25 4 ’ quartiles (1-3) 
chrX I Lit | U iS Wwe unique 
g 0.004 # genes = 208 # genes = 3,653 sD 
oO 
High transcription >10 
12,000,000 
Qa 
a 
4 9,000,000 
ao “ 
2 6,000,000 | 
e 
E 3,000,000 
ro} | 
oe 0 0.00 4 # genes = 285 | # genes = 9,470 
0.25 0.50 0.75 -10 kbp Promoter / TSS TTS +10 kbp 
Average methylation Normalized position along gene body 
Data’ -Rolling mean {n=10) = Untranscribed and hypomethylated Transcribed and hypermethylated 
10} 0 0 1 57 69 
NPIPAS NPIPA7 NPIPA8 NPIPAQ NPIPA1 
Fy 1.00 : hike aT ee aces 
o ae a 
=! | A 
8 0.75 i | 
= (Mocs | 
Ss I i; 
fe} 
= 0. | le \. 
% 0.50 \ 
> oy 
S o254 1% 
E | | 
o | | 
Q. I | 
O 0.00 i i 
TSS TTS 


Fig. 6. SD methylation and gene transcription. (A) Methylated (red) or 
unmethylated (blue-gray) SD blocks in the CHM13 genome on the basis of 
processing ONT data. The histogram shows the distribution of average 
methylation across these regions. (B) Median methylation signal of SD (red) and 
unique (blue-gray) genes stratified by their Iso-Seq expression levels in CHM13. 
The filled intervals represent the 25 and 75 percentiles of the observed data. 


recent work indicates that these large-scale 
variants are associated with infertility through 
increasing sperm aneuploidy, decreasing rates 
of embryonic cleavage (in vitro fertilization, 
IVF), and increasing miscarriages (72-79). Dis- 
tinguishing between fixed and heteromorphic 
acrocentric SDs will facilitate such research 
as well as the characterization of breakpoints 
associated with Robertstonian translocations— 
the most common form of human translo- 
cation (80). 

At a finer-grained level, the T2T-CHM13 
reference and the use of long reads from other 
human genomes provides access to other com- 
plex forms of variation involving duplicated 
gene families. Short-read copy number varia- 
tion analyses and single-nucleotide polymor- 
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Normalized position along gene body 


phism microarrays have long predicted that 
SDs are enriched 10-fold for copy number varia- 
tion, but the structural differences underly- 
ing these regions—as well as their functional 
consequences—have remained elusive (/0, 87). 
We reveal elevated levels of human genetic va- 
riation in genes important for human neuro- 
development (7BCID3) and disease (LPA, SMN). 
Even between just two genomes (GRCh38 
and CHM13), we find that 37% (81 Mbp) of 
SD bases are uncharacterized or structur- 
ally variable, and that this predicts 182 copy 
number variable genes between two human 
haplotypes (table S6). 

In cases such as TBCID3, we find that most 
human haplotypes vary structurally (64 to 
78%). Different individuals carry different com- 
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Vertical lines indicate the position of the transcription start site (TSS) and the 
transcription termination site (TTS). (©) Methylation signal across the recently 
duplicated NPIPA gene family in CHM13, showing increased methylation in 
transcriptionally active copies. Black points are individual methylation calls, and 
the red line is a rolling mean across 10 methylation sites. The labels in gray show 
the number of CHM13 Iso-Seq transcripts and the gene name. 


plements and arrangements of the TBC1D3 
gene family. The potential ramifications of this 
considerable expansion in humans versus 
chimpanzees and of such high structural het- 
erozygosity among humans are notable given 
this gene’s purported role in expansion of the 
frontal cortex (7). Similarly, we were able to 
reconstruct the complete structure of the LPA 
gene model in multiple human genotypes. 
Although LPA is only a single gene, variability 
in the tandemly repeated 5.2-kbp protein- 
encoding kringle IV domain underlies one of 
the most important genetic risk factors for 
cardiovascular disease. Sequence resolution of 
structural variation—as well as underlying ami- 
no acid differences—allow us to predict pre- 
viously uncharacterized risk alleles for disease 
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(Fig. 5). Sequence-resolved structural variation 
improves genotyping and tests of selection 
(49, 82, 83), providing a path forward for un- 
derstanding the disease and evolutionary im- 
plications of these complex forms of genetic 
variation. 

Finally, the T2T-CHM13 reference—coupled 
with other long-read datasets—enables genome- 
wide functional characterization of recently 
duplicated genes. Both gene annotation and 
large-scale efforts to characterize the regu- 
latory landscape of the human genome have 
typically excluded repetitive regions, including 
the 859 human genes that map to high- 
identity SDs (84, 85). This is because the 
underlying short-read sequencing limits con- 
ventional RNA-seq or Chip-seq data from 
being assigned unambiguously to specific 
duplicated genes. 

In this study, we used long-read full-length 
transcript data (Iso-Seq) (86) with long-read 
methylation data from ONT sequencing of the 
same genome to simultaneously investigate 
epigenetic and transcriptional data against a 
fully assembled reference genome. The long- 
read data from the same haploid source 
facilitated the unambiguous assignment of 
these functional readouts, allowing us to cor- 
relate methylation and transcript abundance. 
Our initial analyses suggest that a large fraction 
of duplicate genes are in fact epigenetically 
silenced (characterized by hypermethylation 
of the promoter and hypomethylation of the 
gene body) and that this epigenetic mark may 
be used to predict actively transcribed loci even 
when genes are virtually identical (Fig. 6 and 
fig. S23). Although more human genomes and 
diverse tissues need to be investigated to assess 
the implications of this observation, it is clear 
that phased genome assemblies (49) with long- 
read functional readouts such as methylation 
(65), transcription, or Fiber-seq (87, 88) pro- 
vide a powerful approach to understanding 
the regulatory landscape of duplicated and 
copy number polymorphic genes in the hu- 
man genome. 

However, there are several remaining chal- 
lenges: First, not all human haplotypes cor- 
responding to specific duplicated regions could 
be fully sequence-resolved with automated as- 
sembly of long-read HiFi sequencing technology. 
Most of the 250 unresolved regions of phased 
human genomes generated solely with HiFi 
long reads correspond to some of the largest 
and most variable duplicated regions of the 
human genome (49). For example, only 25% 
of SMNI and SMN2 haplotypes were fully 
resolved by HiFi assembly, and these un- 
resolved loci are predicted to carry some of 
the most complex structural variation patterns. 
In comparison, the T2T-CHM13 assembly used 
both accurate HiFi and ultra-long ONT data, 
and future assembly methods that combine 
these technologies will likely be critical for 
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diploid T2T assembly and the complete char- 
acterization of SD haplotypes (/8, 86). 

Another important challenge going forward 
will be accurate representation of these more 
complex forms of human genetic variation, 
including functional annotation where linear 
representations may be insufficient. Although 
a more complex pangenome reference graph 
could overcome these limitations, it is unclear 
how this will be achieved in practice or how 
it will be adopted by genomics and clinical 
communities. This highlights the importance 
of not only the construction of a pangenome 
reference but development of necessary tools 
that will distinguish paralogous and ortholo- 
gous sequences within duplications to allow 
for comparison between haplotypes with dif- 
ferent SD architectures. The work currently 
underway by the Human Pangenome Reference 
Consortium (HPRC), Human Genome Struc- 
tural Variation Consortium (HGSVC), and 
Telomere-to-Telomere (T2T) Consortium will 
be key to developing these methods and com- 
pleting our understanding of SDs and their 
role in human genetic variation. 


Materials and methods summary 


SDs in T2T-CHM13 were identified using 
SEDEF (27) after repeat masking with Tandem 
Repeats Finder (TRF) (89) and RepeatMasker 
(90). Syntenic one-to-one alignments were 
determined using halSynteny (97). Copy num- 
ber prediction based on short-read data was 
performed with WSSD (3, 74) and mrsFAST 
(92), and regions of comparable copy num- 
ber were determined with the changepoint 
package in R (93). To generate gene annota- 
tions we used the tools Liftoff (94) and GffRead 
(95). Fosmid probes were selected from the 
ABC10 library (28, 96) and two-color FISH was 
performed to experimentally validate acrocen- 
tric SDs (97-99). All assemblies (table S12) with 
the exception of T2T-CHM13 and GRCh38 were 
assembled with hifiasm v0.12 (47) using default 
parameters. Assembly validation of TBCID3 
was performed using sample-matched ONT 
data by checking the consistency of read align- 
ments to the assemblies (J00, 101). Phyloge- 
netic analysis of TBCID3 was performed with 
MAFFT and RAxML (102-105). Assembling 
pangenome graphs for select loci was performed 
with minigraph (67). Methylation analysis was 
performed using the methods and data described 
in Gershman e¢ al. (106) using Winnowmap2 
and Nanopolish for mapping and methyla- 
tion calling (65, 107). Data visualization and 
figures (with the exception of Miropeats) 
(108) were primarily made in R making use 
of GenomicRanges (J09), Tidyverse (110), 
karyoploteR (71), and circlize (772). Pipelines 
used for large-scale data analysis were con- 
structed with Snakemake (113-115). Detailed 
descriptions of materials and methods are avail- 
able in the supplementary materials (25). 
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INTRODUCTION: To faithfully distribute genetic 
material to daughter cells during cell division, 
spindle fibers must couple to DNA by means 
of a structure called the kinetochore, which 
assembles at each chromosome’s centromere. 
Human centromeres are located within large 
arrays of tandemly repeated DNA sequences 
known as alpha satellite (aSat), which often 
span millions of base pairs on each chromo- 
some. Arrays of oSat are frequently surrounded 
by other types of tandem satellite repeats, 
which have poorly understood functions, along 
with nonrepetitive sequences, including tran- 
scribed genes. Previous genome sequencing 
efforts have been unable to generate complete 
assemblies of satellite-rich regions because of 
their scale and repetitive nature, limiting the 


ability to study their organization, variation, 
and function. 


RATIONALE: Pericentromeric and centromeric 
(peri/centromeric) satellite DNA sequences 
have remained almost entirely missing from the 
assembled human reference genome for the 
past 20 years. Using a complete, telomere-to- 
telomere (T2T) assembly of a human genome, 
we developed and deployed tailored computa- 
tional approaches to reveal the organization 
and evolutionary patterns of these satellite 
arrays at both large and small length scales. 
We also performed experiments to map pre- 
cisely which oSat repeats interact with kine- 
tochore proteins. Last, we compared peri/ 
centromeric regions among multiple individ- 
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Gapless assemblies illuminate centromere evolution. (Top) The organization of peri/centromeric satellite 
repeats. (Bottom left) A schematic portraying (i) evidence for centromere evolution through layered expansions 
and (ii) the localization of inner-kinetochore proteins in the youngest, most recently expanded repeats, which 
coincide with a region of DNA hypomethylation. (Bottom right) An illustration of the global distribution of 

chrX centromere haplotypes, showing increased diversity in populations with recent African ancestry. 
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uals to understand how these sequences vary 
across diverse genetic backgrounds. 


RESULTS: Satellite repeats constitute 6.2% of 
the T2T-CHM13 genome assembly, with oSat 
representing the single largest component 
(2.8% of the genome). By studying the se- 
quence relationships of aSat repeats in detail 
across each centromere, we found genome- 
wide evidence that human centromeres evolve 
through “layered expansions.” Specifically, dis- 
tinct repetitive variants arise within each centro- 
meric region and expand through mechanisms 
that resemble successive tandem duplications, 
whereas older flanking sequences shrink and 
diverge over time. We also revealed that the 
most recently expanded repeats within each 
aSat array are more likely to interact with 
the inner kinetochore protein Centromere 
Protein A (CENP-A), which coincides with re- 
gions of reduced CpG methylation. This sug- 
gests a strong relationship between local 
satellite repeat expansion, kinetochore position- 
ing, and DNA hypomethylation. Furthermore, 
we uncovered large and unexpected structural 
rearrangements that affect multiple satellite 
repeat types, including active centromeric aSat 
arrays. Last, by comparing sequence informa- 
tion from nearly 1600 individuals’ X chro- 
mosomes, we observed that individuals with 
recent African ancestry possess the greatest 
genetic diversity in the region surrounding the 
centromere, which sometimes contains a pre- 
dominantly African aSat sequence variant. 


CONCLUSION: The genetic and epigenetic prop- 
erties of centromeres are closely interwoven 
through evolution. These findings raise impor- 
tant questions about the specific molecular 
mechanisms responsible for the relationship 
between inner kinetochore proteins, DNA hy- 
pomethylation, and layered aSat expansions. 
Even more questions remain about the function 
and evolution of non-aSat repeats. To begin 
answering these questions, we have produced a 
comprehensive encyclopedia of peri/centromeric 
sequences in a human genome, and we demon- 
strated how these regions can be studied with 
modern genomic tools. Our work also illumi- 
nates the rich genetic variation hidden within 
these formerly missing regions of the genome, 
which may contribute to health and disease. This 
unexplored variation underlines the need for 
more T2T human genome assemblies from ge- 
netically diverse individuals. 
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Existing human genome assemblies have almost entirely excluded repetitive sequences within and near 
centromeres, limiting our understanding of their organization, evolution, and functions, which include 
facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome 
assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and 
centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of 
these regions revealed multimegabase structural rearrangements, including in active centromeric 
repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship 
between the position of the centromere and the evolution of the surrounding DNA through layered 
repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel 
of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these 


complex and rapidly evolving regions. 


or two decades, genome sequencing and 
assembly efforts have excluded an esti- 
mated 5 to 10% of the human genome, 
most of which is found in and around 
each chromosome’s centromere (J, 2). 
These large regions contain highly repetitive 
DNA sequences, which impede assembly from 
short DNA sequencing reads (/, 3). Centro- 
meres function to ensure proper distribution 
of genetic material to daughter cells during 
cell division, making them critical for genome 


stability, fertility, and healthy development (4). 
Nearly everything known about the sequence 
composition of human centromeres and their 
surrounding regions, called pericentromeres, 
stems from individual experimental observa- 
tions (5-8), low-resolution classical mapping 
techniques (9, 10), analyses of unassembled 
sequencing reads (J1-/4), or recent studies of 
centromeric sequences on individual chromo- 
somes (15-17). As a result, millions of bases in 
the pericentromeric and centromeric regions 


(hereafter peri/centromeres) remain largely 
uncharacterized and omitted from contempo- 
rary genetic and epigenetic studies. Recently, 
long-read sequencing and assembly methods 
enabled the Telomere-to-Telomere Consortium 
to produce a complete assembly of an entire 
human genome (T2T-CHM13) (2). This effort 
relied on careful measures to correctly assem- 
ble, polish, and validate entire peri/centromeric 
repeat arrays (2, 18). By deeply characterizing 
these recently assembled sequences, we present 
a high-resolution, genome-wide atlas of the 
sequence content and organization of human 
peri/centromeric regions. 

Centromeres provide a robust assembly 
point for kinetochore proteins, which physi- 
cally couple each chromosome to spindle fibers 
during cell division (4). Compromised centro- 
mere function can lead to nondisjunction, a 
major cause of somatic and germline disease 
(19). In many eukaryotes, the centromere is 
composed of tandemly repeated DNA se- 
quences, called satellite DNA, but these se- 
quences differ widely among species (20, 21). 
In humans, centromeres are defined by alpha 
satellite DNA (aSat), an AT-rich repeat family 
composed of ~171 base pair (bp) monomers, 
which can occur as different subtypes repeated 
in a head-to-tail orientation for millions of 
bases (22, 23). In the largest aSat arrays, dif- 
ferent monomer subtypes belong to higher- 
order repeats (HORs); for example, monomer 
subtypes a, b, and c can repeat as abc-abc-abe 
(24, 25). HOR arrays tend to be large and 
highly homogeneous, often containing thou- 
sands of nearly identical HOR units. However, 
kinetochore proteins associate with only a 
subset of these HOR units, usually within the 
largest HOR array on each chromosome, which 
is called the active array (25, 26). Distinct aSat 
HOR arrays tend to differ in sequence and 
structure (27, 28), and like other satellite re- 
peats, they evolve rapidly through mechanisms 
such as unequal crossover and gene conver- 
sion (29, 30). Consequently, satellite arrays 
frequently expand and contract in size and 
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generate a high degree of interindividual poly- 
morphism (29-37). Active oSat HOR arrays are 
flanked by inactive pericentromeric regions, 
which often include (i) smaller arrays of di- 
verged oSat monomers that lack HORs (27, 32); 
Gi) transposable elements (TEs); (iii) segmental 
duplications, which sometimes include ex- 
pressed genes (33, 34); and (iv) non-oSat sat- 
ellite repeat families (35), which have poorly 
understood functions. Given the opportunity 
to explore these regions in a complete human 
genome assembly, we investigated the precise 
localization of inner kinetochore proteins with- 
in large aSat arrays and surveyed sequence- 
based trends in the structure, function, variation, 
and evolution of peri/centromeric DNA. 


A comprehensive map of peri/centromeric 
satellite DNA 


Human peri/centromeric satellite DNAs rep- 
resent 6.2% of the T2T-CHM13v1.1 genome 
(~189.9 Mb) (tables S1 and S2 and figs. S1 and 
$2). Nearly all of this sequence remains un- 
assembled or belongs to simulated arrays 
called reference models (72) in the current 
GRCh38/hg38 reference sequence (hereafter, 
hg38), including pericentromeric satellite DNA 


families that extend into each of the five acro- 
centric short arms. From decades of individual 
observations, a framework for the overall or- 
ganization of a typical human peri/centromeric 
region has been proposed (Fig. 1A). By annotat- 
ing and examining the repeat content of these 
regions in the CHM13 assembly (Fig. 1, B and 
C; figs. S1 and S2; table S1; and database S1), 
we tested and largely confirmed this broad 
framework genome-wide at base-pair resolu- 
tion. However, we uncovered unexpected large- 
scale structural rearrangements and previously 
unresolved satellite variants (fig. S1). 

All centromeric regions contain long tracts, 
or arrays, of tandemly repeated aSat mono- 
mers (85.2 Mb total genome-wide) (Fig. 1, B 
and C) (36). Most chromosomes also contain 
classical human satellites 2 and/or 3 (HSat2 
and HSat3, totaling 28.7 and 47.6 Mb, respec- 
tively). HSat2 and HSat3 are derived from the 
simple repeat (CATTC)n and constitute the 
largest contiguous satellite arrays found in 
the human genome, including a 27.6-Mb array 
on chromosome 9 (chr9) (Fig. 1, B and C) 
(11, 37, 38). Furthermore, two distinct satellite 
DNA families constitute the most AT-rich re- 
gions of the genome (37, 39), which we refer to 
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as HSat1A (13.4 Mb total, found mostly on chr3, 
chr4, and chr13) and HSatiB (found mostly on 
chrY, with 1.2 Mb on the acrocentrics) (table 
S1). Two additional large families, beta satellite 
(BSat; 7.7 Mb total) and gamma satellite (ySat; 
630 kb total), are more GC-rich than aSat and 
contain dense CpG methylation (fig. $3). All 
remaining annotated pericentromeric satellite 
DNAs total 5.6 Mb, with 1.2 Mb representing 
previously unresolved types of satellite DNA 
(table S1 and fig. S2) (40). Nonsatellite bases 
between satellite arrays and extending into 
the p-arms and q-arms are considered “centric 
transition” regions, which largely represent 
long tracts of segmental duplications, including 
expressed genes (Fig. 1C and fig. S1) (2, 47, 42). 
These annotations provide a complete and de- 
tailed map of all the peri/centromeric sequences 
in a human genome. 


Complete assessment of aSat substructure 
and genomic organization 


To better understand the organization and evo- 
lution of aSat arrays, we generated a genome- 
wide database of oSat monomers (42). We 
grouped these monomers into distinct classes 
belonging to 20 suprachromosomal families 
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Fig. 1. Overview of all peri/centromeric regions in CHM13. (A) Schematic of a generalized human peri/centromeric region, identifying major sequence 


components and their properties (not to scale). HSat2,3 repeat unit 
genome-wide. (C) Micrographs of representative 4',6-diamidino-2-phe 


engths vary by genomic region. (B) Barplots of the total lengths of each major satellite family 
nylindole (DAPI)-stained chromosomes from CHM13 metaphase spreads, next to a color-coded 


map of peri/centromeric satellite DNA arrays [available as a browser track (database S1)]. Large satellite arrays are labeled. 
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(SFs) (tables S2 and S3 and database S2) 
(32, 43, 44) and identified 80 different HOR 
arrays and more than 1000 different mono- 
mers in HORs across the genome (70 Mb total) 
(table S4 and database S3) (38). Although 18 out 
of 23 chromosomes contain multiple, distinct 
HOR arrays (up to nine), only one HOR array 
per chromosome is active, meaning that it 
consistently associates with the kinetochore 
across individuals (Fig. 1, B and C, and table 
S4) (25). The active array on each chromosome 
ranges in size from 4.8 Mb on chr18 down to 
340 kb on chr21, which is near the low end of 
the estimated oSat size range for chr21 among 
healthy individuals (45). Inactive HOR arrays 
tend to be much smaller (8 Mb total genome- 
wide) (Fig. 1, B and C, and table S4). Adjacent to 
many homogeneous HOR arrays are regions of 
divergent aSat HORs, in which HOR periodic- 
ity is somewhat or even completely eroded (44), 
as well as highly divergent aSat monomeric 
layers that lack HOR structure (32), totaling 
15.2 Mb in CHM13. 

The completeness and quality of the T2T- 
CHM13 assembly also allowed us to resolve 
HOR arrays that are highly similar between 
chromosomes, such as those found on chro- 
mosomes 13/21, 14/22, and 1/5/19, which have 
confounded studies in the past (7, 27, 36). 
Within these arrays, we identified chromosome- 
specific sequence variants and patterns of 
structural variants, which we validated using 
flow-sorted chromosome libraries for the chro- 
mosome 1/5/19 arrays (fig. S4) (42). This enabled 
us to infer their evolutionary history, provid- 
ing evidence that the 1/5/19 HOR first origi- 
nated on chr19 (42). 


Large structural rearrangements in 
peri/centromeric regions 


Producing complete maps of peri/centromeric 
regions revealed the large-scale organization 
of satellite DNAs and their embedded non- 
satellite sequences, including TEs and genes 
(Fig. 2, A to E). Although divergent oSats con- 
tain many inversions (46) and TE insertions 
(47), such events within active HOR arrays are 
unexpected because they were considered to 
be homogeneous (48, 49). Quantifying strand 
inversions across entire satellite arrays revealed 
unexpected anomalies (Fig. 2, A, B, and §; fig. 
S1; table S5; and databases S4 and S5). For 
example, we uncovered a 1.7-Mb inversion 
inside the active oSat HOR array on chr1l 
(Fig. 2A), along with inversions in inactive 
HOR arrays on chr, chr16, and chr20 (figs. S1 
and S5). Unexpectedly, the large pericentro- 
meric HSat3 array on chr9 and the BSat arrays 
on chri and the acrocentrics contain more than 
200 inversion breakpoints (Fig. 2A and fig. S5), 
whereas in other arrays inversions are rare. 
Apart from inversions, two multimegabase 
HSat1A arrays appear to have inserted in and 
split the active HOR arrays on chr3 and chr4 
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(fig. Sl and table S6). We also found evidence 
for an ancient duplication event that predated 
African ape divergence and involved a large 
segment of the ancient chr6 centromere 
plus about 1 Mb of adjacent p-arm sequence 
(database S6) (42). This duplication created 
a different centromere locus that hosts the 
current active chr6 HOR array. 

We also assigned HSat2 and HSat3 arrays 
to their respective sequence subfamilies from 
(11) and found previously unresolved chro- 
mosomal localizations of several HSat3 sub- 
families (such as HSat3B1 on chr17) (Fig. 2, B 
and D). However, we also noticed a lack of 
HSat3B2 on chrl, contrary to expectations 
based on different cell lines (77), implying a 
large deletion of this subfamily on chr1 in 
CHM13. 

To better understand whether these satel- 
lite inversions, insertions, and deletions are 
common outside of the CHM13 genome, we 
searched for them across 16 haplotype-resolved 
draft diploid assemblies of genetically diverse 
individuals from the Human Pangenome Ref- 
erence Consortium (HPRC) (50). This revealed 
that the inversion in the active oSat HOR array 
on chr1 is polymorphic across individuals and 
evident in about half of ascertainable haplo- 
types (11 of 24) (fig. S6). However, the HSat1A 
insertions on chr3 and chr4 are present in all 
ascertainable haplotypes (32 of 32 and 33 of 
33, respectively) (fig. S7). Furthermore, CHM13’s 
missing chr1 HSat3B2 array is contained with- 
in a 400-kb polymorphic deletion, which we 
detected in 29% (8 of 28) of haplotypes exam- 
ined (Fig. 2A and fig. S7). Thus, these peri/ 
centromeric structural rearrangements are not 
specific to the CHM13 genome but are present 
either variably or fixed across humans. 


TE and gene interspersion in 
peri/centromeric regions 


Like inversions and insertions, TEs are virtu- 
ally absent from homogeneous HOR arrays 
but are enriched in divergent aSat in CHM13 
(Fig. 2E and database S7) (47, 57). The CHM13 
assembly also revealed regions where combi- 
nations of TE sequences have been tandemly 
duplicated, forming “composite satellites” 
[described in (40)]. We also found that other 
satellites—such as HSat1A/B, HSat3, and BSat— 
often include fragments of ancient TEs as part 
of their repeating units, a phenomenon rarely 
observed in aSat HOR arrays (Fig. 2, A, B, and 
E, and fig. S8) (39, 52, 53). 

We also compared our pericentromeric 
maps with gene annotations reported for T2T- 
CHM13, revealing 676 gene and pseudogene 
annotations embedded between large satellite 
arrays, including 23 protein coding genes and 
141 long noncoding RNAs (IncRNAs) (exclud- 
ing the acrocentric short arms) (table S7 and 
database S8) (2). One region on chri7, located 
between the large HSat3 and aSat arrays (Fig. 
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2C), contains two protein-coding genes: KCNJ17, 
which encodes a disease-associated potassium 
channel in muscle cells (54), and UBBP4, which 
encodes a functional ubiquitin variant that may 
play a role in regulation of nuclear lamins (55). 
KCNJ17 is missing from GRCh38, which likely 
has caused inaccurate and missed variant calls 
in paralogous genes KCNJ12 and KCNJI8 (56). 
This region also contains a IncRNA annota- 
tion (LINC02002), which starts inside an SST1 
element and continues into an adjacent 33-kb 
array of divergent oSat. Furthermore, we iden- 
tified a processed paralog of an apoptosis- 
related protein-coding gene, BCLAFI (BCL2 
Associated Transcription Factor 1), as part of 
a segmental duplication embedded within an 
inactive oSat HOR array on chri6 (fig. S9). 


The fine repeat structure of satellite 
DNA arrays 


To further chart the structure of peri/centromeric 
regions at high resolution, we compared indi- 
vidual repeat units within and between differ- 
ent satellite arrays. We decomposed each aSat 
HOR array first into individual monomers and 
then into entire HORs, revealing the positions 
of full-size canonical HORs and structural var- 
iant HORs resulting from insertions or dele- 
tions (databases S9 and S10) (42). Whereas 
some chromosomes, such as chr7, are com- 
posed almost entirely of canonical HOR units, 
others, such as chr10, contain many structural 
variant HOR types, with high variation in the 
relative frequency of these structural variants 
across individuals (Fig. 3A and fig. S10). 
Unlike oSat, some families such as HSat2 
and HSat3 have inconsistent or unknown 
repeat unit lengths and often contain an ir- 
regular hierarchy of smaller repeating units. 
We refer to these repeat units as nested tan- 
dem repeats (NTRs), a more general term than 
HORs, which are composed of discrete num- 
bers of monomers of similar lengths. To ex- 
pand our ability to annotate repeat structure 
within assembled satellite DNA arrays, we 
developed NTRprism, an algorithm to dis- 
cover and visualize satellite repeat periodicity 
[(42), similar to (57)]. Using this tool, we dis- 
covered HORs in HSat1 and Sat arrays, as 
well as NTRs in multiple HSat2,3 arrays, such 
as a 2235-bp repeating unit in the HSat3B1 
array on chr17 (Fig. 3B and fig. S11). We also 
applied this tool in smaller windows across 
individual arrays, showing that repeat peri- 
odicity can vary across an array, which is 
consistent with NTRs evolving and expanding 
hyper-locally in some cases (fig. S11). 


Genome-wide evidence of layered expansions 
in centromeric arrays 


The T2T-CHM13 assembly also provides an 
opportunity to examine how peri/centromeric 
sequences evolve. A “layered expansion” mod- 
el for centromeric oSat evolution has been 
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Fig. 2. Structural rearrangements, genes, and TEs in peri/centromeric regions. 
(A) The peri/centromeric region of chrl (cylindrical schematic at top), zooming 

into the transition region between the large aSat and HSat2 arrays (tracks 1 

to 4). Track 1, satellite families (color key at bottom left), with vertical placement 
indicating the strand with canonical satellite repeat polarity. Track 2, positions of TEs 
overlapping aSat or HSatl,2,3, colored by TE type. Track 3, annotated transcription 
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hypothesized on the basis of limited observa- 
tions of the most diverged aSat sequences in 
the human genome [reviewed in (36)]. This 
model postulates that distinct oSat repeats 
periodically emerge and expand within an 
active array, displacing the older repeats side- 
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ways and becoming the site of kinetochore 
assembly. The newer, expanding aSat se- 
quences can originate from within the same 
array (32) or from a different array (intra- 
versus interarray seeding) (58, 59). As this pro- 
cess iterates over time, the displaced sequences 


the previously unresolved HSat3B1 array indicated with an asterisk. (C) Gene 
annotations between the aSat and HSat3 arrays on chr17. (D) Heatmap showing 
the major and minor localizations of each aSat HOR SF (top; red) and each HSat2,3 
subfamily (bottom; blue). “N” indicates localizations not described in (11). Dash “—" 
indicates the chr1 HSat3B2 array deleted in CHM13. HSat3A3 and 3A6 are 
predominantly found on chrY (not in CHM13). (E) Barplots illustrate the number of 
inversion breakpoints (strand switches) or the number and type of TEs detected per 
megabase within different satellite families. div, divergent aSat (AHORs + monomeric). 


form distinct layers that flank the active cen- 
tromere with mirror symmetry (Fig. 3C), and 
these flanking layers rapidly shrink and decay 
(7, 32, 44). We used the T2T-CHM13 assembly 
to infer the evolutionary dynamics of oSat 
repeat arrays genome-wide to test the layered 
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Fig. 3. Genome-wide evidence of layered expansions in centromeric 
aSat arrays. (A) (Top) HOR structural variant positions across the active 
aSat arrays on chr7 and chr10 (gray, canonical HORs; other colors, structural 
variants). (Bottom) Percentages of HOR structural variant types on HiFi 
sequencing reads from 16 HPRC cell lines. Variant nomenclature is described 
in (42); canonical HOR percentages are listed on the plot. (B) Repeat 
periodicities identified with NTRprism for the HSat3B1 array on chr17. 

(C) Comparison of the age and divergence of LINE TEs embedded in different 
aSat SF layers. (D) (i) Four centromeres in which an active HOR array of 
distinct origin appears to have expanded within a now-inactive HOR array. 
(ii) and (iii) Monomeric SFs (rainbow colors) surrounding active HOR arrays 


expansion model and understand how it may 
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on eight chromosomes, with major HOR-haps shown (k = 2 to 3). Red, 
younger, emphasized below with red rectangles; gray, older, emphasized 
below with asterisks. (E) Zoomed-in view of chr3 aSat HOR arrays, divided 
into finer symmetrical HOR-haps (k = 7). (F) (Left) Minimum evolution 

tree showing the phylogenetic relationships between all HORs, colored by 
fine (k = 7) HOR-hap assignments. Red and gray ellipses group major 
HOR-hap divisions into younger and older variants, respectively (42). 
(Right) Phylogenetic tree built from HOR-hap consensus sequences derived 
from branches in the left tree, rooted with a reconstructed ancestral cen3 
active HOR sequence (ANC) (42). Branch lengths indicate base substitutions 
per position. 


relate to centromere function. In doing so, we 
detected evidence of layered expansions across 
all aSat sequences, from the most diverged 
fringes of monomeric oSat to the cores of 
active HOR arrays. 

First, we confirmed that two types of diver- 
gent oSat symmetrically flank HOR arrays 


(database S11) and monomeric aSats (table S8), 
which represent ancient, decayed centromeres 
of primate ancestors (32). We classified diver- 
gent oSat into distinct SFs and dHOR families 
and demonstrated how these sequences accu- 
mulate mutations, inversions, TE insertions, 
and non-aSat satellite expansions over time 
(Fig. 3C; tables S5, S6, and S9; and databases 


across the genome: divergent HORs (dHORs) 
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S4, S5, and S7). We also found gradients of 
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size and intra-array divergence (17 to 26%) in 
monomeric oSat layers, a steep (~10%) diver- 
gence increase between HORs and dHORs, 
and a gradient of embedded TE quantity and 
age that parallels the age of monomeric layers 
(Figs. 2E and 3C, table S9, and database S7) 
(17, 82, 44). 

We next asked whether the layered expan- 
sion pattern extends into the active oSat HOR 
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arrays. On four chromosomes, the active HOR 
array is surrounded symmetrically by inactive 
HORs of a distinct type, which is consistent 
with interarray seeding [chr1 (60), chr2, chr16, 
and chr18] (Fig. 3D). In the assembled cen- 
tromeres from chrX (/6, 67, 62) and chr8 (77), the 
central part of the active array was found to 
contain HOR variants slightly different from 
those on the flanks. To test whether this array 
structure is typical, we aligned individual HOR 
units within the same array and clustered them 
on the basis of their shared sequence variants 
(49, 63, 64) into “HOR-haplotypes” or “HOR- 
haps” (42). Initial broad classifications of HOR 
units into two to four distinct HOR-haps per 
array revealed symmetrical layering, which 
typically expands from the middle of the array 
and is consistent with intra-array seeding and 
expansion (Fig. 3D, dark red versus gray). Fur- 
ther classification into a larger number of 
HOB-haps (5 to 10) found additional evidence 
for symmetric patterns (Fig. 3E) (42). 

By building rooted phylogenetic trees of 
consensus HOR-haps, we confirmed that the 
middle HOR-haps are the most recently evolved 
(Fig. 3F) (42). We also verified this using com- 
plete phylogenetic analysis of all HOR units on 
chr3, chr8, and chrX (shown for chr3 in Fig. 3F) 
(42). In addition, the intra-array divergence in 
central HOR-haps is often slightly lower than 
in the flanking arrays, indicating that the cen- 
tral HOR-haps have expanded more recently 
(Fig. 3F) (42). Together, these findings pre- 
sent genome-wide evidence that active oSat 
HOR arrays evolve rapidly through layered 
expansions, raising the question of how this 
dynamic evolutionary process relates to the 
positioning of the centromere. 


Precise mapping of sites of 
kinetochore assembly 


Human centromeres are defined epigenet- 
ically as the specific subregion bound by inner 
kinetochore proteins within each active oSat 
HOR array (21, 65). Centromeres contain a 
combination of epigenetic marks that distin- 
guish them from the surrounding pericen- 
tromeric heterochromatin. For example, the 
histone variant Centromere Protein A (CENP-A) 
is constitutively present at centromeres (66) and 
is often accompanied by “centrochromatin”- 
associated modifications to canonical histones 
(67). Active oSat arrays also have generally high 
CpG methylation compared with that of neigh- 
boring inactive arrays (26) and contain local 
regions of reduced CpG methylation called 
centromere dip regions (CDRs) (J6, 17, 26). 
To study HOR organization at sites of kineto- 
chore assembly, we identified discrete regions 
of CENP-A enrichment within each active array 
using sequencing data from native chromatin 
immunoprecipitation (NChIP-seq) [data from 
(17)] and from CUT&RUN [data from this 
study (42)] (table S10) (68). To map these short 
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sequencing reads within aSat arrays, we de- 
veloped specialized, repeat-sensitive alignment 
approaches (42). 

We confirmed that CENP-A binding is almost 
exclusively localized within aSat HOR arrays, 
with one active array per chromosome (tables 
S4 and S11) (25). We also found the strongest 
CENP-A enrichment near and within CDRs 
on all chromosomes (17, 26). We found that the 
complete span of each centromere position, 
defined as a window with high CENP-A en- 
richment, extends outside of the CDR and 
totals 190 to 570 kb on each chromosome 
(Fig. 4, A and B, and table S11). Each CENP-A 
span occupies 7 to 24% of the total length of 
the active HOR array in which it is embedded 
(table S11), which is contrary to predictions 
from previous work on chrX and chrY in dif- 
ferent cell lines (69). However, we cannot ex- 
clude the possibility that lower levels of CENP-A 
extend beyond these windows of strong en- 
richment, or that the sizes of these windows 
vary among cells or cell types. We detected 
smaller regions of CENP-A enrichment outside 
of the primary CDR, with some overlapping a 
minor, secondary CDR (chr 4, chr16, and chr22) 
or no CDR at all (chr18) (Fig. 4B, fig. $12, and 
table S11). Furthermore, similar dips in CpG 
methylation, although infrequent, do occur 
outside CENP-A-associated regions, as ob- 
served in a5S RNA composite satellite array 
(40) and within a 10-kb region in the active 
aSat HOR array on chr5 (fig. S12). 

We also found that CENP-A is typically en- 
riched in young, recently expanded HOR-haps 
(Fig. 4, A to D, and table S11). For example, in 
the active array on chr12, CENP-A is enriched 
on only one of two large macro-repeat struc- 
tures, both of which contain similar young 
HOR-haps (Fig. 4A and fig. $13). Further in- 
vestigation revealed that CENP-A and the 
CDR coincide with a zone of very recent HOR 
expansions (eight sites of nearly identical 
duplications within a ~365-kb region) (fig. $13) 
(42) that distinguish one macro-repeat region 
from the other (Fig. 4, A and D). On most other 
chromosomes, we similarly observed a pre- 
dominant zone of recently expanded young 
HOR-haps (42), which tends to associate with 
CENP-A (eight more examples are shown in 
fig. S14 and table S11). 

However, we identified a few notable ex- 
ceptions to this general trend. On chr4, which 
has two CENP-A regions occurring on either 
side of a 1.7-Mb HSat1A array, we found that 
the larger CENP-A region spans a slightly 
younger HOR-hap, and the minor CENP-A 
region spans an older HOB-hap (Fig. 4, B and 
D). On chr5, chr7, and chr13, CENP-A overlaps 
young HOR-haps but not near the predominant 
zone of recent expansions on that chromosome 
(fig. S15 and table S11) (42). Inversely, CENP-A 
overlaps the zone of recent expansion on chr2, 
but this zone is composed of older HOR-haps 
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(fig. S15). On chr6, we observed CENP-A en- 
richment within an older HOR-hap layer, more 
than a megabase away from the major zone 
of recent duplications and expansions in this 
centromere (Fig. 4, C and D). Last, chr21 shows 
enrichment across the entire active HOR array 
(the smallest in CHM13) (table S11). We ob- 
served that human centromeres and CDRs are 
typically, although not universally, positioned 
over young and/or recently expanded layers 
within active HOR arrays in CHM13, indicat- 
ing that centromere function is closely related 
to the rapid evolution of aSat sequences. 


Genetic variation across human X centromeres 


Satellite DNA arrays are highly variable in size 
across individuals. The extremes of satellite 
size variation are often plainly visible under the 
microscope in chromosomal karyotypes (30), 
yet the clinical relevance of these variants re- 
mains unknown and largely unexplored. Studies 
have provided low-resolution sequencing-based 
evidence for variability in both satellite array 
lengths and in the frequency of certain se- 
quence and structural variants within homan 
populations (7-13, 29). However, satellite array 
variation and evolution have remained poorly 
understood at base-level resolution owing to a 
lack of complete centromere assemblies. 

Therefore, we characterized and compared 
centromere array assemblies from chrX across 
seven XY individuals with diverse genetic 
ancestry [lymphoblastoid cell lines from (70)] 
(Fig. 5A, fig. S16, and table S12). We assigned 
repeats in the cenX active array to seven HOR- 
haps, revealing both localized and broad var- 
iation within each array (42). For example, we 
identified duplications spanning hundreds of 
kilobases in two assemblies relative to CHM13 
(HGO01109 and HG03492) (Fig. 5A and fig. S17). 
Four of the seven arrays contain zones of re- 
cent HOR expansion in the younger HOR-hap 
(CHM13, HG01109, HG02145, and HG03098). 
The remaining three assemblies show a trend 
of recent expansion within older HOR-haps 
closer to the p-arm (HG03492, HG01243, and 
HG02055). We also found evidence for a re- 
cently expanded HOR-hap type (HOR-hap 6) 
present in three individuals with recent African 
ancestry but absent in the other individuals, 
including CHM13 (Fig. 5A, dark red). 

Next, we studied how this variation within 
aSats relates to variation across single-nucleotide 
variants (SNVs) that tend to be co-inherited with 
the centromere. Because meiotic crossover rates 
are low in peri/centromeric regions (77), cen- 
tromeres are embedded in long haplotypes, 
called cenhaps (72). Cenhaps are identified by 
clustering pericentromeric SNVs into phylo- 
genetic trees and then splitting them into 
large clades of shared descent. We divided 
a group of 1599 XY individuals genotyped 
using published short-read sequencing data 
(73) into 12 cenhaps (with 98 individuals 
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remaining unclassified) (Fig. 5B, fig. S18, and 
database S12). We also used these short-read 
sequencing data to estimate the absolute size 
of each individual’s chrX active HOR array 
(fig. S19 and database S12) (72, 72), along with 
the relative proportion of that individual’s 
array belonging to each HOR-hap (42). This 
analysis revealed that distinct cenhaps have 
different aSat array size distributions and dif- 
ferent average HOR-hap compositions (Fig. 5B 
and fig. S18). For example, HOR arrays be- 
longing to cenhaps 1 and 2 tend to be larger 
overall than those belonging to cenhaps 3 to 
12. We found a recent duplication in the chrX 
HOR array, representing hundreds of kilo- 
bases, that is common in cenhap 1 and can 
explain the relatively larger average array sizes 
in this cenhap (Fig. 5B). 

Two of the 12 cenhaps (1 and 2) are very 
common in non-African populations (49 and 
47% of individuals, respectively) and rare in 
African populations (1.7 and 3.5%, respec- 
tively) (Fig. 5C). The remaining 10 cenhaps 
are almost exclusive to African populations 
as well as those with recent African admixture 
(ASW, PUR, CLM, and ACB). The relatively 
low cenhap diversity in non-African popula- 
tions is consistent with their lower overall 
genetic diversity, which is attributable to 
demographic bottlenecks during early human 
migrations out of Africa (70). This analysis 
also revealed that HOR-hap 6 appears to be 
almost exclusively found in cenhaps 10 to 12, 
which form an anciently diverged clade within 
African populations (Fig. 5B). These findings 
demonstrate that centromere-linked SNVs can 
be used to tag and track the evolution of aSat, 
and they underline the need for greater repre- 
sentation of African genomes in pan-genome 
assembly efforts. 

Last, to dissect the sequence differences 
between two arrays from the same cenhap, 
we compared two finished centromere assem- 
blies from CHM13 and HGO00O2, a cell line 
whose chrX array had been constructed by 
use of T2T assembly methods and whose array 
structure had been experimentally validated 
(2). We found both genomes to be highly con- 
cordant across the array, apart from three re- 
gions, where we observed recent amplifications 
and/or deletions of repeats (Fig. 5D and fig. 
$20). These comparisons of completely as- 
sembled centromeres demonstrate that sat- 
ellite DNA variation is common at both coarse 
and fine scales, raising the question of how this 
genetic variation relates to possible epigenetic 
variation in centromere positioning. 


Epigenetic variation across human 
X centromeres 


To examine how centromere positioning varies 
among individuals, we compared patterns of 
CENP-A CUT&RUN enrichment on the fully 
assembled chrX centromeres from HG002 and 
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Fig. 5. Substantial genetic and epigenetic variation in and around the 

chrX centromere. (A) Comparing the active aSat HOR array on chrX (DXZ1) 
between (top) CHM13 and six HPRC cell line HiFi read assemblies. Tracks 
indicate HOR-haps (top, k = 7; bottom, k = 2) and recent HOR duplication events 
(bottom, as in Fig. 4A). (B) (Left) Phylogenetic tree illustrating the relationships 
of 12 cenhaps defined by using short-read data from 1599 XY genomes from 
(70, 73) plus HGO02, CHM13, and HuRef. Triangle vertical length is proportional 
to the number of individuals in that cenhap (98 individuals, labeled NA and 
colored dark gray, belong to small clades not among the 12 major cenhaps). 
(Middle) Barplots illustrating the average HOR-hap compositions for all 
individuals within each cenhap, colored as in (A). (Right) Ridgeline plots 
indicating the distribution of estimated total array sizes for all individuals 
within each cenhap, with individual values represented as jittered points. 


CHM13 (26). The region with the strongest 
CENP-A enrichment in both arrays coincides 
with the most pronounced sequence differ- 
ences between CHM13 and HGOO2, mostly 
because of structural rearrangements (Fig. 
5D, yellow, and fig. S20). Despite these local 
structural differences, CENP-A remains posi- 
tioned over CDRs and young HOR-haps in 
both individuals. 

Last, we asked whether CENP-A enrichment 
patterns were consistently found in younger 
HOR-haps, as observed in CHM13 and HG002, 
across seven additional cell lines with publicly 
available CENP-A NChIP-seq and CUT&RUN 
datasets (Fig. 5E and fig. $21). Unlike CHM13, 
in three XY individuals we observed CENP-A 
enrichment within the older HOR-hap sub- 
region, proximal to the p-arm, indicating the 
presence of an epiallele [HuRef (74), HT1080b 
(75), and MS4221 (76)]. This coincides with an 
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alternative CDR observed in the HG03098 cell 
line [CDR I from (26)] (Fig. 5E). Further, we 
examined two independent CUT&RUN ex- 
periments from the RPE-1 cell line (XX) (77) 
and found enrichment on both older and 
younger HOR-haps, which could be explained 
if the two chrX homologs carry different func- 
tional epialleles. Three additional XX cell lines 
were consistent with CHM13, providing evi- 
dence that the same CENP-A-enriched HOR- 
hap is shared across both chrX homologs in 
each line (IMS13q, PDNC4, and K562) (Fig. 5E 
and fig. S21) (78). These overlap a CDR also 
seen in the HGO1109 cell line [CDR II from 
(26)] (Fig. 5E). A third CDR proximal to the 
q-arm was observed in the HG01243 and 
HG03492 cell lines (26), which is indicative 
of a third possible CENP-A epiallele. These 
findings uncover frequent variation in the 


position of the chrX centromere, with some 
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(C) Populations represented among the 1599 XY genomes, with pie charts 
indicating the proportion of cenhap assignments within each population, with 
the same colors used as in the tree in (B). Population descriptions are in (42). 
(D) Comparison of the DXZ1 assembly for CHM13 and HGOO2, which are both 
in cenhap 2. Tracks are as in (A), with the addition of a top track to indicate 
regions that align closely (gray) or are diverged (yellow) between the two 
individuals. Vertical dotted line indicates the homologous site of a CHM13 
expansion on the HGOO2 array. (Bottom) StainedGlass dotplots representing 
the percent identity of self-alignments within the array, with a color-key 

and histogram below (88). (E) A comparison of CENP-A coverage (NChIP-seq 
or CUT&RUN) in eight cell lines relative to the CHM13 chrX centromere 
assembly. Each track is normalized to its maximum peak height in the array. 
Below are CDR positions from (26). 


XX individuals potentially harboring heter- 
ozygous epialleles. 


Discussion 


This study provides comprehensive maps of 
recently assembled human peri/centromeric 
regions to facilitate exploration of their func- 
tion, variation, and evolution. Using this re- 
source, we uncovered strong evidence that 
the genetic and epigenetic fates of centromeres 
are intertwined through evolution: oSat arrays 
evolve through layered expansions, and the 
inner-kinetochore protein CENP-A tends to 
associate with the most recently expanded 
sequences. The kinetochore frequently shifts 
to new loci, and the old loci rapidly shrink 
and decay. 

One possible explanation for this relation- 
ship is that aSat expansions occur indepen- 
dently of the kinetochore, but the kinetochore 
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maintains an affinity for some property of 
recently expanded sequences, such as their 
homogeneity (the “independent expansion 
hypothesis”). Kinetochore-independent expan- 
sion is feasible in light of our observation of 
large duplications and localized repeat ex- 
pansions in noncentromeric satellites such as 
HSat3 arrays, which are not associated with 
kinetochores (fig. S11). Another possibility is 
that kinetochore proteins—or other proteins 
that may associate with the centromere such 
as loading, replication, recombination, or re- 
pair factors—play a causal role in the expansion 
of particular HOR variants [the “kinetochore 
selection hypothesis” (36)]. This aligns with 
the proposed recombination-based homoge- 
nization process in Arabidopsis (79). Further, 
experiments in model organisms have demon- 
strated that extreme array sequence variants 
increase meiotic and mitotic nondisjunction 
rates and can promote both mutational drive 
and/or female meiotic drive (20, 80-82). Sim- 
ilar drive mechanisms (83), along with selec- 
tion for variants that promote high-fidelity 
chromosome transmission, may also play a 
role in shaping centromeric sequence evolu- 
tion in humans. Exploring these evolutionary 
models, as well as studying why CENP-A co- 
localizes with CDRs, will require precise exper- 
imental methods for measuring interactions 
between kinetochore proteins and repetitive 
DNA [such as DiMeLo-seq (84)]. 

Fully assembled peri/centromeric regions 
also provide a reference against which se- 
quencing information from multiple individ- 
uals can be aligned and compared. By doing 
so, we uncovered a 400-kb polymorphic dele- 
tion of an entire HSat3 array and a 1.7-Mb 
polymorphic inversion in an active aSat HOR 
array, both on chri. We also detected an ex- 
pansion of a particular aSat sequence variant 
on chrX in individuals with recent African 
ancestry. This high degree of satellite DNA 
polymorphism underlines the need to pro- 
duce T2T assemblies from genetically diverse 
individuals, to fully capture the extent of hu- 
man variation in these regions, and to shed 
light on their recent evolution. Measuring 
this variation will also be essential to under- 
stand the functional consequences of satellite 
variation on centromere function or, in the 
case of HSat3, on phenomena such as sat- 
ellite transcription in response to stress [re- 
viewed in (38)]. 

Along with genetic variation, we identified 
epigenetic variation in the location of CENP-A 
within the oSat array on chrX, similar to a rare 
but well-studied epiallele on chr17 (85-87). 
CENP-A is typically positioned on young 
HOR-haps on chrX, as seen for most chro- 
mosomes in CHM13. However, in some cell 
lines, CENP-A appears to be positioned over 
older chrX HOR-haps more than a megabase 
away (Fig. 5E), which is similar to the posi- 
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tioning of the chr6 CENP-A locus in CHM13. 
Thus, although CENP-A tends to localize to 
the most recently expanded HORs, there are 
exceptions on at least some chromosomes in 
some individuals. Studying centromere posi- 
tioning across many samples, across families, 
and across different tissues from the same indi- 
viduals will reveal the extent of this epigenetic 
plasticity in centromere localization and how 
this epigenetic variation relates to genetic var- 
iation and evolution. This will potentially illu- 
minate how human cells maintain essential 
centromere functions despite the rapid evolu- 
tion of centromeric DNA and inner-kinetochore 
proteins, an anomaly referred to as the “cen- 
tromere paradox” (20). 


Materials and methods 


A very brief methods overview is provided 
here. Detailed methods are provided in (42). 
Repeats in the T2T-CHM13 assembly were 
annotated by parsing and combining output 
from RepeatMasker [provided in (40)] along 
with custom-built pipelines for annotating 
aSat and HSat2,3 (42). Regions identified as 
“SAR” by RepeatMasker were annotated as 
HSat1A, and regions annotated as “HSATI” 
by RepeatMasker were annotated as HSat1B. 
aSat HOR-haps were identified by (i) gen- 
erating multiple alignments of all HOR units 
(or subregions of HOR units) from an array, 
Gi) deriving a consensus sequence, (iii) recod- 
ing the individual sequences into binary vec- 
tors based on matches to the consensus, and 
(iv) clustering these binary vectors by use of 
k-means clustering. Phylogenetic analyses of 
aSat sequences were performed with MEGAS. 
Dotplots colored by percent identity were 
produced with StainedGlass (88). 

To analyze short-read NChIP-seq and 
CUT&RUN data, two parallel methods were 
developed: (i) marker-assisted mapping to 
the T2T-CHM13 reference and (ii) reference- 
free region-specific marker enrichment. For 
marker-assisted mapping, reads were aligned 
to the reference then filtered to include only 
alignments that overlap precomputed nucle- 
otide oligomers of length & (k-mers) that occur 
in only one distinct position in the reference. 
For reference-free enrichment analysis, a 
set of k-mers that are enriched in CENP-A- 
targeted sequencing reads (relative to reads 
from input or immunoglobulin G controls) 
were first identified. Next, these enriched 
k-mers were compared with precomputed 
k-mers in the reference that occur exclusively 
within a single window of a given size (“region- 
specific markers”). Windows with multiple 
matches to enriched k-mers were reported 
as enriched for CENP-A. We performed a sim- 
ilar analysis using HOR-hap-specific markers 
on chrX, to reveal the broad enrichment of 
CENP-A on each HOR-hap across multiple 
individuals (fig. $21). 
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INTRODUCTION: Transposable elements (TEs), 
repeat expansions, and repeat-mediated struc- 
tural rearrangements play key roles in chromo- 
some structure and species evolution, contribute 
to human genetic variation, and substantially 
influence human health through copy number 
variants, structural variants, insertions, dele- 
tions, and alterations to gene transcription and 
splicing. Despite their formative role in genome 
stability, repetitive regions have been relegated 
to gaps and collapsed regions in human ge- 
nome reference GRCh38 owing to the techno- 
logical limitations during its development. The 
lack of linear sequence in these regions, par- 
ticularly in centromeres, resulted in the in- 
ability to fully explore the repeat content of 
the human genome in the context of both local 
and regional chromosomal environments. 


RATIONALE: Long-read sequencing supported 
the complete, telomere-to-telomere (T2T) as- 
sembly of the pseudo-haploid human cell line 
CHM13. This resource affords a genome-scale 
assessment of all human repetitive sequences, 
including TEs and previously unknown re- 
peats and satellites, both within and outside 


of gaps and collapsed regions. Additionally, 
a complete genome enables the opportunity 
to explore the epigenetic and transcriptional 
profiles of these elements that are fundamen- 
tal to our understanding of chromosome struc- 
ture, function, and evolution. Comparative 
analyses reveal modes of repeat divergence, 
evolution, and expansion or contraction with 
locus-level resolution. 


RESULTS: We implemented a comprehensive 
repeat annotation workflow using previously 
known human repeats and de novo repeat 
modeling followed by manual curation, in- 
cluding assessing overlaps with gene annota- 
tions, segmental duplications, tandem repeats, 
and annotated repeats. Using this method, we 
developed an updated catalog of human re- 
petitive sequences and refined previous repeat 
annotations. We discovered 43 previously un- 
known repeats and repeat variants and char- 
acterized 19 complex, composite repetitive 
structures, which often carry genes, across 
T2T-CHM13. Using precision nuclear run-on 
sequencing (PRO-seq) and CpG methylated 
sites generated from Oxford Nanopore Tech- 
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Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human 
reference T2T-CHMI3 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read-based 
methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define 
retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, 
including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE—variable number tandem repeat— 
Alu; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, polyadenylation signal. 
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nologies long-read sequencing data, we assessed 
RNA polymerase engagement across retro- 
elements genome-wide, revealing correlations 
between nascent transcription, sequence diver- 
gence, CpG density, and methylation. These 
analyses were extended to evaluate RNA poly- 
merase occupancy for all repeats, including 
high-density satellite repeats that reside in 
previously inaccessible centromeric regions of 
all human chromosomes. Moreover, using both 
mapping-dependent and mapping-independent 
approaches across early developmental stages 
and a complete cell cycle time series, we found 
that engaged RNA polymerase across satellites 
is low; in contrast, TE transcription is abun- 
dant and serves as a boundary for changes in 
CpG methylation and centromere substructure. 
Together, these data reveal the dynamic rela- 
tionship between transcriptionally active retro- 
element subclasses and DNA methylation, as 
well as potential mechanisms for the deriva- 
tion and evolution of new repeat families and 
composite elements. Focusing on the emerging 
T2T-level assembly of the HGO02 X chromo- 
some, we reveal that a high level of repeat var- 
iation likely exists across the human population, 
including composite element copy numbers 
that affect gene copy number. Additionally, we 
highlight the impact of repeats on the struc- 
tural diversity of the genome, revealing repeat 
expansions with extreme copy number differ- 
ences between humans and primates while 
also providing high-confidence annotations 
of retroelement transduction events. 


CONCLUSION: The comprehensive repeat anno- 
tations and updated repeat models described 
herein serve as a resource for expanding the 
compendium of human genome sequences and 
reveal the impact of specific repeats on the 
human genome. In developing this resource, 
we provide a methodological framework for 
assessing repeat variation within and between 
human genomes. The exhaustive assessment of 
the transcriptional landscape of repeats, at both 
the genome scale and locally, such as within 
centromeres, sets the stage for functional studies 
to disentangle the role transcription plays in the 
mechanisms essential for genome stability and 
chromosome segregation. Finally, our work 
demonstrates the need to increase efforts toward 
achieving T2T-level assemblies for nonhuman 
primates and other species to fully understand 
the complexity and impact of repeat-derived 
genomic innovations that define primate lin- 
eages, including humans. 
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Mobile elements and repetitive genomic regions are sources of lineage-specific genomic innovation and 
uniquely fingerprint individual genomes. Comprehensive analyses of such repeat elements, including 
those found in more complex regions of the genome, require a complete, linear genome assembly. We 
present a de novo repeat discovery and annotation of the T2T-CHM13 human reference genome. We 
identified previously unknown satellite arrays, expanded the catalog of variants and families for repeats 
and mobile elements, characterized classes of complex composite repeats, and located retroelement 
transduction events. We detected nascent transcription and delineated CpG methylation profiles to 
define the structure of transcriptionally active retroelements in humans, including those in centromeres. 
These data expand our insight into the diversity, distribution, and evolution of repetitive regions that 


have shaped the human genome. 


tudies of mobile elements and repeat 
arrays have long shown that eukary- 
otic genomes are in constant flux (7). 
Transposable element (TE) insertions 
and repeat-mediated structural rear- 
rangements can influence gene regulation, 
create new coding structure, and affect chromo- 
some stability. Transposition, expansion, and 
contraction of repeats generate species-specific 
genomic innovations (/, 2), major evolution- 
ary transitions (3), and human- and primate- 
specific adaptations (4). Together, TEs and other 
forms of repetitive DNA, constituting more than 
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half of the human genome, are the largest con- 
tributor to human genetic variation and affect 
human health (5) owing to their roles in dele- 
terious copy number variants (CNVs), struc- 
tural variants (SVs), insertions, deletions, and 
alterations to gene transcription and splicing. 

A major challenge in tracking and under- 
standing repeat structure, function, and vari- 
ation is that large complex repeats, sequences 
in tandem arrays, and recent insertions by 
TEs have been largely impenetrable to avail- 
able sequencing and assembly technologies. 
Despite this challenge, a species-agnostic re- 
peat database (the Dfam database) (6), manual 
curation (7), and the development of improved 
algorithms for repeat discovery (8, 9) have laid 
the groundwork underlying efforts to create 
and finish a complete map and catalog of the 
repertoire of human repeats. 

Previous assemblies of a reference human 
genome contained gaps and collapsed re- 
peats (JO). Capitalizing on recent advances 
in ultralong sequencing and assembly meth- 
ods, the Telomere-to-Telomere (T2T) Consor- 
tium generated a complete human reference 
genome on the basis of the pseudo-haploid 
genome of an androgenetic hydatidiform mole 
(CHMI13hTERT cell line, hereafter CHM13) (7). 
This assembly, T2T-CHM13v1.1, resulted in the 
addition of more than 200 mega-base pairs 
(Mbp) of DNA and resolution of collapsed and 
unassembled regions in previous reference 
genomes. The gap-filled and decompressed 
regions, representing 8% of the human ge- 
nome, are dominated by tandemly arrayed 
repeats [such as in the alpha satellite arrays 


that are found in higher-order repeat arrays 
(HORs) within centromeres (/2)] and complex 
repeats in pericentromeres, subtelomeres, and 
some chromosome arms (i.e., acrocentrics). 
T2T-CHM13 supported additional annota- 
tions for human repetitive sequences resid- 
ing in previously unassembled regions of the 
human reference GRCh38 and added repeat 
annotations for low copy repeats genome-wide. 
In total, we identify 53.9% of the T2T-CHM13 
assembly as repetitive. Here we highlight key 
advances from this resource, while illustrating 
the power of combining multiple approaches 
and tools to enhance genomic discoveries. 
Eukaryotic repeats are classified into two 
main types on the basis of their genomic or- 
ganization: tandem repeats and interspersed 
repeats (Fig. 1A) (6, 73). Tandem repeats are 
further subdivided into satellites and simple 
repeats; satellites are often further defined 
by their regional chromosomal distribution 
(centromeric, for example). With the excep- 
tion of pseudogenes retroposed from struc- 
tural RNAs (tRNAs, ribosomal RNAs, etc.), 
interspersed repeats largely refer to TEs, which 
are classified on the basis of their mechanism 
of propagation (6, 14-16). Class I elements are 
spread within genomes via retrotransposition 
and are further subdivided into two broad 
subclasses. One subclass consists of long inter- 
spersed elements (LINEs) and long terminal 
repeat (LTR) elements, which typically encode 
their own catalyzing enzymes. The other con- 
sists of short interspersed elements (SINEs) 
and the composite retroelement SINE-VNTR- 
Alus (SVAs), both of which are nonautonomous, 
relying on LINE-encoded proteins for retro- 
transposition. Class II elements are those that 
are mobilized through transposase, helicase, 
or recombinase and include TEs such as Tcl- 
Mariner and hAT. These varied repeat types 
constitute a major portion, and in some cases 
the majority [for example, 85% of wheat ge- 
nomes (17)], of eukaryotic genome sequences. 
The varied modes of propagation of such 
repeats, from simple insertion events to pro- 
moting nonallelic recombination, facilitate 
genomic diversity, often in bursts of activity 
followed by periods of neutral evolution. Fur- 
thermore, organismal defense mechanisms 
that have evolved to counter the deleterious 
effects of mobilization, such as DNA methyl- 
ation, can influence the sequence evolution 
of targeted elements. Repeats represent the 
nexus of evolutionary forces, the selfishness 
of mobile elements, and the cellular mecha- 
nisms marshaled to silence them. The genomic 
turbulence engendered by repeats makes them 
the most challenging genomic regions to study. 
However, insights from studies of these regions 
have revealed regulatory and coding domains 
critical to organismal life histories and human 
health. A full accounting of repeat domains 
permitted by a gapless telomere-to-telomere 
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DNA sequence is therefore essential to a full 
understanding of the origins and function of 
the human genome. 


Results 
Comprehensive repeat annotations for a 
complete human genome 


We developed a computational pipeline to dis- 
cover previously unknown repeat annotations 
and tandem arrays while reducing false posi- 
tives from pseudogenes, segmental duplica- 
tions, and Dfam overlaps (78) (fig. SI). At each 
step, computational analysis was supplemented 
by manual curation and polishing. In total, 
49 previously unidentified repeat types from 
RepeatModeler were curated, including 27 re- 
peats (Fig. 1B and fig. S2) as well as 22 poten- 
tially older TE repeats whose alignment scores 
precluded classification and were thus set aside 
(table S1). Among the 27 identified repeats were 
one previously unknown centromeric satellite 
[86.6% of base pairs found within centromere 
regions defined in (17, 12)] and 10 repeats clas- 
sified into five variants of known satellites 
[three centromere transition satellites (GSATII, 
HSAT5v1, and HSAT5v2) and two interstitial sat- 
ellites (SATR1 and SST1)] and five previously 
unknown repetitive sequences. Manual cura- 
tion identified an additional 13 interstitial sat- 
ellite arrays (and monomers of the satellites); 
three repetitive sequences, all of which were 
previously unknown and unclassified; and 
19 composite elements (including 16 curated 
composite subunits), defined as a repeating 
unit consisting of three or more repeated se- 
quences, including TEs, simple repeats, com- 
posite subunits, and satellites (fig. S2). In total, 
62 repeat entries were classified and submitted 
to Dfam as previously unannotated human re- 
peats, with 19 elements added as a “composite” 
track for the T2T-CHM13v1.1 genome browser 
(Table 1 and table S2). 

This updated repeat library yielded annota- 
tions of human repeats within regions previ- 
ously unresolved in GRCh38 and provided copy 
number support to identify additional, previ- 


ously unnoticed repeat elements genome-wide 
(Fig. 1B and fig. $2). Using this T2T-CHM13- 
based repeat library, the T2T-CHM13v1.1 assem- 
bly was fully annotated for all repeat classes, 
resulting in 1.65 giga-base pairs (Gbp) of repeat 
annotations (53.94% of the genome), of which 
168.3 Mbp are found within the 182.1 Mbp 
of gap-filled T2T-CHM13 genomic sequence 
(92.4%), representing added annotations, and 
5.5 Mbp of which are previously unknown hu- 
man repeats that we identified genome-wide 
(Table 1; tables S2 and S3; and fig. S3). Re- 
annotation of GRCh38 (without the Y chromo- 
some) using the T2T-CHM13 repeat database 
resulted in annotation of 2,114,766 bp of pre- 
viously uncataloged repeats (Table 2 and table 
$3), demonstrating the utility of a T2T-level 
assembly in supporting more comprehensive 
repeat annotations. Additionally, reannotation 
of the GRCh38 Y chromosome revealed pre- 
viously unidentified annotations consisting of 
six composite elements, eight satellite arrays, 
156 satellite variants, and six unclassified re- 
peats, totaling 161,055 bp in repeat annota- 
tions discovered through this study (fig. S4 
and table S4). 

The reannotated GRCh38 and annotated 
T2T-CHM13 were compared with reverse 
liftOver coordinates (CHM13 to GRCh38) to 
identify TE insertions specific to CHM13 (78). 
TEs found in CHM13 but not in GRCh38 were 
further grouped into those that are in gap-filled 
regions (nonsyntenic overlap) or those that 
are potentially polymorphic between these 
two genomes or were collapsed in the GRCh38 
assembly (syntenic overlap but missing in 
GRCh838) (Fig. 1C). 

Across 4,531,994 TEs with lifted coordinates 
(ie., shared between T2T-CHM13 and GRCh38), 
118,787 lifted TE pair annotations were dis- 
cordant between the two genomes (fig. S5A); 
82.3% of these (97,719 discordant liftOver pairs) 
were typically short loci with low scores and 
therefore of questionable discordance (19), 
and/or subtle subfamily reclassifications (fig. 
S5B and table S5). Among the 20,427 unlifted 


——— SS 
Table 1. Complete genome assembly supported discovery and refinement of human repeat 
annotations. Repeats identified through RepeatModeler and manual curation (RMv2) shown in 
counts and base pairs, by category, for T2T-CHMI13v1.1 and GRCh38 (excluding the Y chromosome) 


(Fig. 1B and table S2). 


CHM13v1.1 GRCh38 (excluding Y) 
Repeat category 
RMv2 RMv2 
Bp Count Bp 
Composite subunits 2,805,296 1,162 536,979 
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1,025,084 


1,127,758 
5,926,979 


2,114,766 


TEs specific to T2T-CHM13, all TE classes are 
represented (Fig. 1C), with 35.2% of TE sites 
specific to gap-filled regions in T2T-CHM13 
(7194 total TEs) (tables S5 and S6). Unlifted 
TE sites are found genome-wide, with a higher 
density on the acrocentric chromosomes 13, 
14, 15, 21, and 22 (fig. S5C and table $7). 


Composite elements shape the human genome 
and local methylation 


Composite structural elements contribute to 
human diversity and disease through struc- 
tural variation and copy number variation, par- 
ticularly when exonic regions are “captured” 
in acore unit (20). We annotated 19 composite 
repeat elements (table S2 and figs. S6 to S11) in 
T2T-CHM13, each composed of three or more 
repeated sequences, including TEs, simple re- 
peats, composite subunits, and satellites (78). 
Most composites are found in a tandem array 
only on a single chromosome (figs. S6, A to F, 
and S7, B to G), and in eight cases, each core 
unit contains protein-coding annotations (fig. 
S7), indicating that unequal crossing-over events 
and concerted evolution among composite units 
contribute to the expansion or contraction of 
gene families within humans (table 82). 
One composite, 55RNA_Comp, consists of a 
portion of the 5S RNA, an AlwY, and two sub- 
unit repeats as an array of 128 repeating units 
with high sequence similarity (most share 98 
to 100% identity) on chromosome 1 (fig. S9, A 
and B). Using methylation profiles developed for 
T2T-CHM13 and long read-based methylation 
clusters (27), we find that the methylation pat- 
tern of the 55RNA_Comp is not consistent 
across the array; rather we find a drop in 
methylation, which we called a methylation 
dip region (MDR), internal to the array, sim- 
ilar to the centromere dip region (CDR) iden- 
tified in higher-order arrays of alpha satellites 
in T2T-CHM13 (27) (Fig. 1D). The location of the 
MDR is not linked to DNA sequence, as neither 
the GC content nor sequence identity is var- 
iable across repeat units in this array (Fig. 1D 
and fig. S9B), suggesting that other epigenetic 
factors may facilitate the drop in methylation. 
We annotated a highly complex composite, 
TELO_Comp, that consists of multiple satellite 
arrays and other composites (Fig. 1E), with 
instances found on 10 chromosomes (figs. S12 
and S13) at interstitial, pericentromeric, and 
subtelomeric loci. The canonical TELO_Comp 
consists of three 3-kbp (kilo-base pair) com- 
posites (TELO-A, -B, and -C subunits), each 
containing multiple TEs, downstream of a 
variable-length array of a 49-bp satellite repeat 
unit, ajax, bounded by a duplicated sequence, 
teucer (Fig. 1E). In-depth analysis of the over- 
all structure of the subunits across all loci and 
phylogenetic analyses of the TELO-A subunit 
(78) (fig. S12 and table S8) indicate that sub- 
telomeric units are a monophyletic group of 
recent origin, likely by segmental duplication 
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Table 2. Repeat annotations are more refined for CHM13v1.1. Kilo—base pairs of repeat annotations, by repeat class and family, for different human 
genome assemblies with T2T-CHM13v1.1 RMv2, GRCh38 RMv2, and GRCh38 Dfam3.3 only. Note that AluJb is included in the Alu repeat family category. 


Repeat class Repeat family 


Kbp 


5S-Deu-L2 


SINE : 
tRNA-RTE 549.0 


CHM13v1.1 GRCh38 (excluding Y) 
RMv2 RMv2 Dfam3.3 
% of assembly Kbp % of assembly Kbp % of assembly 
0.0072 0.0075 


0.0187 525.9 0.0180 


CRI 10,817.9 


0.3698 10,571.1 0.3618 


Ll 512,421.5 


LINE 


Penelope 68.0 


17.3633 507,866.7 WESTOT 


0.0023 63.1 0.0022 


LTR 


ERVL 59,049.8 


2.0082 58,646.0 2.0069 


0.1087 3,081.7 0.1055 


0.0014 40.0 0.0014 


0.0016 45.2 0.0015 


DNA TcMar-Tcl 135.8 


0.0046 134.1 0.0046 


hAT-Tipl00 12,250.5 


N 
% repeatmasked 


events (fig. S13A and tables S8 and S9), where- 
as interstitial and pericentromeric units are 
polyphyletic. Moreover, each subtelomeric unit 
contains the ajax array proximal to the telo- 
mere, indicating that inverted orientations are 
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0.4010 12,034.5 


favored at subtelomeric loci. Location-specific 
repeat diversification in subunit content and 
structure as well as ajax and teucer repeat copy 
numbers, which each retain high sequence 
identity (figs. S13, B and C, S14, and S15, and 


0.4118 1,915.0 0.4077 


tables S10 and S11), reveal differential evolu- 
tionary forces acting on TELO_Comp loci 
on the basis of chromosome location. 
Meta-analysis of aggregated methylation fre- 
quency across the TELO_Comp units (+20 kbp) 
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(Fig. 1E) shows that the ajax satellite array is 
hypermethylated across all elements, with a 
discernible drop in methylation across TELO-A 
subunits and peak of methylation in the MER1A 
unit in elements containing TELO-C. Subtelo- 
meric and interstitial TELO_Comp elements 
share similar methylation profiles, with higher 
methylation levels across the entire element, 
whereas pericentromeric TELO_Comp units 
have lower overall methylation levels. This 
indicates that local epigenetic states affect 
overall methylation levels but do not change 
relative levels within the ajax array and TELO 
subunits. Comparison of aggregated methyla- 
tion frequency across TELO_Comp units at the 
same loci in the human diploid assembly for 
HG002 (fig. S16) (22) show that overall methyl- 
ation levels are higher across TELO_Comp ele- 
ments, including those found in centromeres, 
as expected from global differences in meth- 
ylation level between T2T-CHM13 and HG002. 
However, the overall methylation pattern for 
the TELO_Comp elements (Fig. 1E and fig. 
S16) is retained, indicating it is an epigenetic 
signature of this repeat in humans. 


Transcriptional, epigenetic, and structural 
differences define TEs across the 
human genome 


Precision nuclear run-on sequencing (PRO-seq) 
(22) detects nascent transcription from RNA 
polymerases with nucleotide resolution at 
genome scale. The resulting read density pro- 
files quantitatively reflect the occupancy of 
active polymerases across the genome. Sites of 
accumulating RNA polymerase activity (22, 23), 
such as promoter-proximal pause sites, 3’ cleav- 
age and polyA regions, splice junctions, and 
enhancers, indicate points of transcription 
regulation (22, 24). In addition, because PRO- 
seq captures RNA synthesis before mecha- 
nisms that affect RNA stability take place, 
unprocessed and unstable RNAs can be de- 
tected with high sensitivity. Capitalizing on 
the single-base resolution of PRO-seq and CpG 
methylation profiles (21), we define profiles 
of RNA polymerase activity that distinguish 
different families of retroelements (Fig. 1A). 
We assessed PRO-seq signal, CpG methyla- 
tion density, CpG site density, and sequence 
divergence from the consensus for each ele- 
ment within each subfamily, further classified as 
full-length or truncated and grouped by relative 
age (fig. S17 and tables $12 and S13) (78). For each 
element type, density profiles were correlated 
with known features of specific repeats. 
Across all full-length retroelements in T2T- 
CHM13, PRO-seq density profiles show signals 
of RNA polymerase accumulation (Fig. 2, A 
to E, and fig. S18). AluY elements show two 
signal peaks; the first corresponds to the 
known RNA pol III promoter site within the 
first monomer, while the second, broader peak 
within the second monomer indicates the site 
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of a second, ancient 7SL RNA promoter (25), 
whose presence might promote polymerase 
pausing (Fig. 2A). The peak distribution closely 
mimics the relative size of the left and right Alu 
monomers and thus reflects the dimerization 
of Alu. Although active transcription continues 
in truncated AlwY elements, there is no longer a 
visible signal of promoter exclusivity, and RNA 
polymerase signal spreads across the element. 
Full-length AlwY elements retain a similar 
methylation profile and show low divergence 
levels corresponding with low, single-copy 
k-mer density. Truncated and older elements 
(AluJ and AluS) (table S13 and figs. S18 to S22) 
show broad methylation profiles with low 
CpG content and higher divergence (Fig. 2A). 
Transcriptionally active retroelement fami- 
lies wherein the majority of full-length ele- 
ments show high PRO-seq signal (AluY, SVA, 
and L1Hs; Fig. 2, purple lines in parallel plots) 
do have some full-length members that ex- 
hibit the full diversity of transcriptional ac- 
tivity, likely influenced by local chromatin or 
epigenetic features of surrounding insertion 
sites. 

Whereas PRO-seq signal is detected in trun- 
cated HERV-Ks (human endogenous retrovirus 
type K) that retain LTRs [LT (less than 7500 bp 
in length)/LTR+] (78), signal is reduced and 
completely lost in truncated elements without 
LTRs, as expected (26). Full-length HERV-Ks 
[GT (greater than 7500 bp in length)/LTR+] 
(18) generally have low methylation levels despite 
higher CpG content than the LT HERV-Ks, 
albeit with nonsignificant P values (Fig. 2B 
and figs. S18 to $20). Given the low number 
of HERV-K elements and high identity among 
5’ and 3’ LTRs of HERV-K elements (GT range 
0.21 to 23%, average 12.05%; LT range 1.98 to 
28.96%, average 11.58%), discerning a clear 5’ 
promoter signal was not possible. Further- 
more, SVA_E and SVA_F elements, the only 
SVA elements in the human genome that retain 
mobility (27, 28), both show similar PRO-seq 
peaks (Fig. 2, C and D), which distinguishes 
them from their truncated counterparts SVA_A, 
SVA_B, SVA_C, and SVA_D (figs. S18 to $22). 

We find evidence for RNA polymerase pro- 
moter proximal pausing at the 5’ end of the 
SVA element at predicted transcription start 
sites (TSSs) (29). Notably, we find PRO-seq 
peak signal at the 3’ end within the HERV-K/ 
LTR5a-derived portion of the element, over- 
lapping with the Kruppel-associated box 
(KRAB)-containing zinc finger proteins (KZFPs) 
controlled enhancer activity (TEEnhancer) 
identified in this region (Fig. 2, C and D, gray 
arrowheads) that contributes to human-specific 
early embryonic transcription (30). While some 
truncated SVA_F elements retain the 5’ pro- 
moter signal, most SVA elements retain the 
3' signal (Fig. 2, C and D, and figs. S18, S20, 
and S821) and thus may also retain the ability to 
modulate gene expression. 


LI1Hs elements, a major contributor to hu- 
man structural variation (37), show a strong 
promoter-proximal pause signal at the 5’ end 
(32) (Fig. 2E). This site also contains a methyl- 
ation peak followed by a hypomethylated TSS, 
delineating full-length L1Hs elements from their 
truncated counterparts (Fig. 2E and figs. S18 to 
$22). As elements become inactivated through 
5‘ truncation (33, 34) and increased diver- 
gence, CpG content and transcriptional signal 
drops considerably (Fig. 2, E and F, and figs. 
S18 to $22), indicating that CpGs are likely 
targeted for methylation and subsequent de- 
amination from cytosine to thymine. 

To extend our analyses and demonstrate the 
applicability of this approach in studying other 
complex repeats in the human genome, we 
focused on the TE-derived macrosatellite SST1 
[also called MER22 (35) and NBL2 (36, 37)]. 
SST1 has demonstrated meiotic instability (38), 
and its methylation status is of clinical rele- 
vance to multiple cancer types (39-42). SST1 
arrays are variable in the human population 
(38), and our annotations identify about a two- 
fold increase over the 342 loci (315,515 bases) 
(table S14) identified in GRCh38 (excluding 
the Y chromosome, which carries an additional 
587 loci) (fig. S4). Randomized Axelerated Max- 
imum Likelihood (RAxML) phylogenetic anal- 
ysis with representative loci subsampled from 
the 16 autosomes on which SST1 resides (18) 
(Fig. 3, A and B, and table S15) showed that the 
array situated on the long (q) arm of chromo- 
some 19 represents the ancestral SST1 in the 
human genome and carries a propensity for 
centromere seeding and array size expansions 
or contractions across primate lineages (35, 43). 

The number of overlapping PRO-seq reads, 
average methylation, and percent divergence 
for each SST1 element in CHM13 were com- 
pared to delineate correlations among tran- 
scriptional, epigenetic, and structural features 
of SST1 across genomic loci. PRO-seq revealed 
that the SST1 arrays on chromosome 4 and 
centromeric monomers on chromosomes 9, 
13, and 14 are highly transcribed in compar- 
ison to other SST1 loci and are grouped in a 
single phylogenetic cluster (Fig. 3, A, C, and 
D; fig. S23; and table S16), indicating that 
centromeric SST1 repeat arrays are transcrip- 
tionally inactive in CHM13. 

Statistical analyses of SST1 repeats showed 
that the highly transcribed repeats are both 
longer and less diverged from the consensus 
sequence (¢ test, P < 0.0001) (Fig. 3C, fig. S24, 
and table S17) despite their basal location in 
the phylogenetic tree (Fig. 3A). CpG methyl- 
ation levels are high (>50%) for SST1 within 
chromosome 4 and 19 arrays, low (<50%) for 
centromeric monomers, and variable (low and 
high) for centromeric arrays (Fig. 3, A and D; 
figs. S24 and S25; and tables S16 and S17). 
Metaplots of aggregated methylation fre- 
quency across SST1 repeat units support this 
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Fig. 2. Transcriptional profiles of TEs are highly correlated with sequence 
divergence and epigenetic features. (A to F) RNA polymerase occupancy, 
methylation levels, CpGs, and divergence for (A) AluY, (B) HERV-K, (C) SVA-E, 
(D) SVA-F, (E) LIHs, and (F) LIP elements from CHM13. Heatmaps of (left panel) 
T2T-CHM13 PRO-seq density (Bowtie2 default “best match,” purple scale) and 
average profiles showing sense and antisense strands (upper panels, standard 
error shown in gray) and (right panel) methylated CpGs (red-purple scale, 
aggregated frequency per site) for TEs grouped by their length [(A) to (E)] [full- 
length (FL) and truncated (TR)] or LIPA subfamily [(F), all truncated)]. HERV-K 
groups are delineated as follows: >7500 bp elements (GT) and <7500 bp 
elements (LT) with both 5' and 3' long-terminal repeats (LTR+). (HERV-K 
elements with only one or no LTR are shown in fig. S18C). Both GT and LT/LTR+ 
HERV-K elements are scaled. All other TEs are anchored to the 3' end, with a 


Methylation Frequency 


.0 >15 
PRO-seq read overlaps Total PRO-seq signal _—_ Methylation Frequency PRO-seq read overlaps 


specified distance from the anchor (bottom left). Standard error for composite 
(gray), TSS (transcription start site), TES (transcription end site), location of the 
VNTR (variable number tandem repeat) within SVA are indicated. A dotted line is 
included on the heatmap denoting the static -0.1 kbp from the end of the 
annotated element. Representative schematic of elements and respective 
subcomponents are shown above the composite profile, scaled to the TES; red 
blocks indicate previously known promoter regions. (Right side of each panel) 
Parallel plots for each TE are shown, highlighting each group of TEs (FL/TR, or 
LIP subfamily; HERV-K plots represent LTRs only). Vertical axes represent scaled 
values for average methylation, number of CpG sites, and divergence from 
RepeatMasker consensus sequences for each instance of the element. Coloration 
by the number of overlapping PRO-Seq reads where purple represents the 
highest read overlap and blue the lowest, on the scale matching each plot. 


observation and indicate that while interstitial 
arrays and monomeric SSTIs carry the same 
methylation frequency at their 5’ end, mono- 
meric SST1s lose most methylation across the 
body of the element (Fig. 3E and figs. S25 and 
$26). Irrespective of this methylation pattern, 
heatmaps of PRO-seq density show that all 
highly transcribed SST1s have two internal 
peaks of high RNA polymerase occupancy that 
are closely spaced and in opposite orientations 
(Fig. 3D and fig. S25B), characteristic of RNA 


pol II promoters and enhancers. 
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Together, these data suggest selective pres- 
sure to retain the genomic integrity of older, 
less diverged SST1 arrays and monomers that 
are actively transcribed, whereas silenced re- 
peats found in centromeric arrays are more 
susceptible to sequence variation. Contrary to 
expectations that CpG methylation renders 
repeats transcriptionally silent (44, 45), we 
find that high levels of average methylation 
across interstitial, arrayed SST1s define these 
transcribed repeats on chromosome 4 (Fig. 3, 
A, D, and E) and bear a resemblance to meth- 


ylation patterns observed over gene bodies 
(46, 47). Chromosomal instability and cancer- 
ous phenotypes associated with demethylation 
and/or transcription of SST1 repeats have been 
reported (48, 49), indicating a need to delineate 
patient-specific and locus-specific annotations 
for SST1 (39, 50). 


The transcriptional landscape of 
human centromeres 


Centromere transcription is integral to proper 
centromere function, affecting the loading 
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Fig. 4. Centromere landscape is characterized by the transcription of TEs 
rather than satellites. (A) (Left) Cell sorting data showing the stages of the 
cell cycle after synchronization and release. (Right) Ribbon plots of repeat 
abundance in PRO-seq data [shown as reads per million (RPM)] assessed by 
CASK method in asynchronous and synchronized HeLa cells collected at time 
points across the cell cycle (key in inset). A zoomed image shows the reads for 
the lower range of expressed repeats, including all satellites classified in T2T- 
CHM13 (tan). (B) Ribbon plot of repeat abundance in PRO/ChRO-seq data, 
shown as RPM, assessed by CASK method across different developmental stages 
and samples. Datasets include T2T-CHM13 PRO-seq and native RNA-seq, 
PRO-seq for RPE-1 (differentiated retinal pigment epithelial cells), and ChRO-seq 
for H9 ES (embryonic stem cells), DE (differentiated endoderm cells), duodenum 
tissue, and ileum tissue. A zoomed image shows the reads for the lowest of 
categories of repeats across all samples, including the satellites classified in 


T2T-CHM13. (C) Repeat enrichment across PRO-seq and RNA-seq datasets 

(all times points and tissues) ranked from least (red) to most enriched (blue) 
on the basis of k-mers normalized to genomic frequency in T2T-CHM13. (D and 
E) Recently active retroelements (green ticks in RM2 track) found embedded 
within alpha satellite HOR arrays (red) in (D) an “old” TE island derived 

from segmental duplications on chromosome 3 and (E) solo embedded TEs and 
“young” TE islands on chromosome 1. Stranded PRO-seq profiles (Bowtie2 
default “best match”) across chromosome 3 and 1 regions encompassing the 
centromere are shown (top). TEs are transcriptionally active (PRO-seq Bowtie2 
“best match” mapping (yellow), k-100 overfit mapping (gray), and single 

(blue) and dual filtered (red) k-100 mapping data are indicated for both strands) 
and located (black boxes) at transitions in CpG methylation (metaplot at 
bottom; 200 bins total) and CpG density (blue, below) within the array. Key of 
elements in cenSAT and RM2 tracks indicated at bottom. 


of newly synthesized centromere protein A 
(CENP-A) histones (57-57). Although evidence 
suggests that RNA is a critical component 
of the epigenetic cascade leading to faith- 
ful CENP-A assembly, an assessment of nas- 
cent transcription across human centromeres 
has been lacking. The availability of high- 
confidence centromere annotations for T2T- 
CHM13 (72, 58, 59) provides an opportunity 
to assess transcription and active RNA poly- 
merase activity across previously unresolved 
regions of a human genome reference: the cen- 
tromere and the pericentromere. To capitalize 
on the T2T-level assembly and the resolution 
of PRO-seq at single nucleotides, we developed 
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genome-dependent and genome-independent 
approaches to define the landscape of centro- 
mere transcription (78) (figs. S27 and S28). 
We observed low levels of satellite tran- 
scription (figs. S29 to S33 and table S18), in- 
dicating that RNA polymerase occupancy at 
centromeric satellites in CHM13 is lower than 
that observed for all other repeat types. The 
low levels of satellite transcription are not ex- 
plained by differences in genomic abundance 
between satellite repeats and other repeats. 
Indeed, after normalizing the observed PRO- 
seq levels with shuffled reads, satellite tran- 
scription is the lowest among all other repeat 
types (fig. $33), indicating genome-wide re- 


pression of centromere satellite transcription, 
including the CENP-A-containing HORs (72). 

Given that centromere transcription and 
CENP-A deposition are dynamic processes 
(60), we tested whether repeat transcription 
varied across the cell cycle. After synchroni- 
zation and release into mitosis, we find that 
repeat transcription across the genome drops 
in mitosis (Fig. 4A and fig. S34). SINEs, LINEs, 
and LTRs increase transcription rates at the 
1-hour time point and reach a steady state by 
1.5 hours, coincident with the transition to G1 
after CENP-A loading. Notably, satellite tran- 
scripts are detected, but at low levels across 
the cell cycle (Fig. 4A and figs. S29 to $34). We 
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used available datasets to determine whether 
the low level of satellite transcription was spe- 
cific to CHM13 or its early developmental stage. 
Across cell types and developmental stages, 
retroelements show dynamic PRO-seq pro- 
files, yet satellite transcription remains low 
(Fig. 4B and figs. S35 and 836). Across all cell 
types and time points, alpha satellites within 
the CENP-A-containing HORs (12) show gen- 
erally higher PRO-seq signal than do de- 
generate HOR alpha satellite arrays (AHORs) 
and monomers or interstitial alpha satellites 
(MONS) (fig. S37). Thus, although nascent 
transcription is low, transcription from alpha 
satellites is detectable within the HOR domain 
that demarcates the active centromere (Fig. 
4C). The low level of detectable transcripts 
within the active HOR domains contrasts with 
the transcriptional level of pericentromeric 
satellite arrays where satellite transcripts pro- 
mote the recruitment of chromatin modifiers 
to maintain the heterochromatic status of these 
domains (67). 

TE annotations for T2T-CHM13 show that 
members of retroelement subfamilies known 
to contain full-length and, in some cases, 
transpositionally active members are found 
within centromeric HOR satellite arrays and 
retain their PRO-seq signal (fig. S38 and table 
S19). We find evidence for multiple types of 
TE-alpha satellite associations across T2T- 
CHM13 (table S19); all chromosomes have TE 
insertions within alpha satellites, but several 
lack TEs within HORs (e.g., all acrocentric 
chromosomes). We also find “older” TE islands 
within HORs, derived from segmental duplica- 
tions (Fig. 4D and fig. S39), recent insertions of 
TEs within HORs, and aggregates of TEs that 
appear to form emerging TE islands (Fig. 4E, 
right, and fig. S40). Single insertions of TEs 
found within HORs, dHORs, and monomeric 
regions (table S19) remain transcriptionally 
active (Fig. 4, D and E, and fig. S38) yet show 
limited evidence of transcription of adjacent 
alpha satellites (Fig. 4E and figs. S39 and S40), 
indicating that read-through transcription 
from embedded TEs may affect alpha satel- 
lites, but not in the arrays underlying the 
CDR, the region defined by CENP-A enrich- 
ment (Fig. 4E, left) (12). 

Given the higher proportion of L1Hs in- 
sertions in HORs and work showing a link 
between LI transcription and neocentromere 
formation (57, 62), we compared embedded 
L1Hs within HORs to those found in dHORs, 
monomers, and chromosome arms to deter- 
mine whether LIHs embeds retained their TE 
signatures or were “overwritten” by their local 
chromatin environment. We find no statistical 
evidence that LIHs within HORs and dHORs 
deviate in length, divergence, or average meth- 
ylation from those found outside of these 
regions (figs. S41 and S42 and table S20). 
However, L1Hs within monomeric segments 
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of alpha satellites are both more diverged and 
less methylated than L1Hs that are in HORs 
(P < 0.05), dHORs (P < 0.01), or not embedded 
at all (P < 0.001) and show less transcription 
than their counterparts elsewhere in the ge- 
nome, including those in the HOR and dHOR 
(figs. S38 and S42). 

Although we find no clear link between alpha 
satellite transcription and the CENP-A domain 
overlapping the CDR (12, 21), transcription 
detected from embedded TEs marks shifts in 
methylation frequencies across satellite do- 
mains, establishing putative TE boundaries. 
Whether and how TEs facilitate these shifts is 
unknown. In previous work, the activity and 
copy number of TEs has been linked to alter- 
ations in methylation levels within centro- 
meres in interspecific hybrids, resulting in 
chromosome instability (63), indicating that a 
balance of methylation is required for centro- 
mere stabilization. With the technological ad- 
vances presented in the assembly and annotation 
of the T2T-CHM13 human reference, com- 
parative studies across other species will aide 
in revealing how the structure of the satellite- 
dense centromeres of human differs from that of 
TE-enriched centromeres in other species (64) 
and how these differences affect centromere 
function and chromosome evolution. 


Putative TE-driven genomic DNA transductions 
and their evolutionary consequences 


The complete sequence provided by T2T- 
CHM13 revealed previously unknown patterns 
of repeat expansions across the short (p) arms 
of acrocentric chromosomes. In T2T-CHM13 
(11), we discovered previously unannotated 
repeat arrays of a 64-nucleotide sequence (Fig. 
1B) present in high copy numbers on the p 
arms of acrocentric chromosomes 14, 15, 21, 
and 22 (77) and in single or low copy number 
(<5) on eight other chromosomes (Fig. 5A and 
tables S2 and S21). A solo monomer resides 
on chromosome 10, with all other occurrences 
adjacent to an AlwSx3 element (thus, with Alu 
satellite, or WaluSat). The lack of identity 
among 5’ and 3’ sequences of the chromosome 
10 locus and the AluSx-WaluSat loci on all 
other chromosomes (fig. S43), coupled with 
phylogenetic analyses across primates (figs. 
S43 to $46), indicates that an ancestral dup- 
lication of the chromosome 10 locus was 
followed by a mobile element insertion to 
form the AlwSx-WaluSat unit in the last shared 
ancestor with Catarrhini. 

The WaluSat sequence exists as a single mono- 
mer at eight loci, as a duplication at three loci, 
and in one case as a pentamer. However, once 
segmental duplication events placed the AluSx- 
WaluSat on the p arms of chromosomes 14, 
15, 21, and 22, WaluSat amplified into longer 
arrays, ranging from 26 copies (chromosome 15) 
to 5836 copies (chromosome 14) (Fig. 5A). We 
hypothesize that the high degree of sequence 


similarity and copy number variation among 
p arm WaluSat arrays is due to frequent non- 
allelic or ectopic recombination events on 
acrocentric chromosomes (1/, 65), which may 
be exacerbated by replication challenges asso- 
ciated with the predicted periodic G-quadruplex 
structures (66) identified at junctions of WaluSat 
sequences within arrays (18) (Fig. 5, B and C). 

The low identity among the sequences adja- 
cent to the chromosome 3 AlwSx-WaluSat and 
other AluSx-WaluSat loci, along with the iden- 
tification of putative target site duplications 
(TSDs), may indicate that a transduction event 
followed the Alu insertion and preceded the 
spread across the human genome via dupli- 
cations. TE-mediated transduction (i.e., a TE 
transduction event), a process by which retro- 
elements co-mobilize DNA flanking the ele- 
ment to new genomic loci (67-70), has been 
observed for L1 and SVA elements in humans 
(67-73). TE transduction events mediated by 
Alu elements are seemingly rare (74), likely 
because of efficient termination of RNA poly- 
merase III on sequences with long poly-T tract 
lengths and nearby RNA secondary structures 
(75). Given the age of the initial insertion of the 
AluSx element, it is unknown if such an event 
was mediated by an RNA polymerase III or 
cryptic upstream RNA polymerase II promoter, 
or if other rearrangements specific to chro- 
mosome 3 degraded signal of shared identity 
with other segmental duplications (Fig. 5A, 
dashed box). 

Beyond potentially seeding new repeat se- 
quences across the genome, TE transductions 
can affect the genome through exon shuffling 
(67, 71, 76) and are a possible source of somatic 
mutations (74, 75). Here, we applied a set of 
computational approaches (/8) (fig. S47) to 
annotate putative TE transduction events in 
T2T-CHM13. In total, we analyzed 971,993 
Lis and 7068 SVAs (figs. S48 to S51). After 
stringent filtering for potential artifacts, such 
as segmental duplications and putative dupli- 
cations of truncated elements, we find 65 L1, 
five 3’ SVA, and 115’ SVA transduction events 
(tables $21 and S22 and figs. S50 and S51). 

Of these 81 annotated transduction events, 
78 are shared with GRCh38 (Fig. 5D and table 
$23), and three appear specific to T2T-CHM13. 
One T2T-CHM13 TE transduction is in a re- 
gion of no synteny with GRCh38 and is caused 
by an LIPA4, representing an older event ac- 
cording to Kimura-2 distances (fig. $49). Of 
the remaining two T2T-CHM13 TE transduc- 
tions, both events are derived from the youn- 
gest, human-specific TEs, LIHs and SVA-F, and 
may represent polymorphic TE transductions. 
However, we find the offspring TE in both 
GRCh38 and T2T-CHM13, yet the transduced 
sequence is missing in GRCh38, owing to a 
collapse in the sequence, highlighting the uti- 
lity of a T2T-level assembly in identifying puta- 
tive TE transduction events. 
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Fig. 5. TE activity affects genomic repeat diversity in CHM13. (A) Maximum likelihood (ML) phylogenetic 
analyses of the AluSx3-WaluSat locus across T2T-CHM13. Chromosome location is indicated (starting 
nucleotide position shown) at each branch. Bootstrap values shown at each node, distance indicated by 
length of branch. Left shows the sequential order of events, initiating with a duplication of the chromosome 
10 WaluSat locus followed by mobile element insertion (MEI) of an AluSx3. The identification of putative TSDs 
(pink, fig. S43) and a lack of identity among sequences adjacent to WaluSat on chromosome 3 and all other loci 
(fig. S43) may indicate that a transduction event preceded the spread of AluSx3-WaluSat across the human 
genome (dotted box). MEI events upstream of the AluSx3-WaluSat are concordant with phylogenetic relationships 
among loci and indicate that the derivation of AluSx3-WaluSat loci across other chromosomes were the result 
of segmental duplication events (gray shaded box). Once the AluSx3-WaluSat was duplicated to the acrocentric 
chromosomes 14, 15, 21, and 22, a massive expansion of the WaluSat sequence (blue boxes) occurred. The 
number of WaluSat monomers within each acrocentric array is indicated on the right with monomer number 
relative to maximum monomer count 5836 on chromosome 14. (B) G-quadruplex (G4) analysis of a single 64-mer 
monomer of the WaluSat sequence showed no predicted G4 structures (top), while an in silico construct of a 
tandem array of the WaluSat shows high G4 coverage at the junction between individual WaluSat monomers 
across the array. (C) G4 analysis of the p arm of chromosome 14 shows a peak in G4 predictions coincident with 
the WaluSat array. Bottom is a zoom inset of a subset of the array showing that the junctions between most 
monomers carry predicted G4 structures. (D) Transduction events predicted for CHM13 (L1, pink; SVA 5’, 
purple) and shared between T2T-CHM13 and GRCh38 (gray shades) are shown. Chromosome connections link 


progenitor and offspring locations (fig. S49). 


CHMI3 serves as a reference for comparative 
repeat analyses across humans and other 
primate genomes 

Studies of the link between TE activity and 
chromatin states can extend beyond local in- 
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fluences, as exemplified by LINE and SINE 
transcriptional activity and the chromosome- 
wide silencing of the X chromosome during X 
inactivation (77-79). Two noncoding RNAs on 
the X chromosome are central to the inac- 


tivation of one X in females, Xist and Tsix (80). 
These two loci overlap one another in a sense 
and antisense orientation but are in distinct 
topologically associating domains (TADs); 
Tsi« is the antisense repressor of Xist, whose 
up-regulation leads to X inactivation (87). The 
bipartite structure of the locus in two TADs 
facilitates partitioning of the X inactivation 
center (XIC) and supports appropriate timing 
of X inactivation through Xis¢ transcription in 
early development (82). Moreover, an early step 
in the formation of heterochromatin across the 
inactive X is the silencing of LINEs and SINEs 
within the Xist RNA compartment (77). 

The scarcity of SNPs (27) in T2T-CHM13, 
coupled with the short reads of PRO-seq data, 
made it impossible to discern transcripts orig- 
inating from one X allele versus the other 
within CHM13. However, we were able to 
phase reads into their individual alleles, sup- 
porting the assessment of methylation differ- 
ences of TEs between the two X chromosomes 
in the XIC. PRO-seq signal was found across 
the Xist locus, whereas no signal was detected 
from the Tsix locus, indicating that X inacti- 
vation has proceeded, resulting in differential 
methylation profiles across alleles (Fig. 6A). Low 
methylation (Fig. 6A, blue block in cluster 2) 
marks the initiation of Xist transcription, 
followed by high methylation levels across the 
Xist/Tsizx locus on this allele, inclusive of the 
interspersed repeats found across the locus 
(Fig. 6A and table $25). A distinct pause (indi- 
cated by a pileup of PRO-seq signal, boxed in 
Fig. 6A) after the termination signal of the 
Xist transcript unit was found that coin- 
cides with the TAD junction and delineates 
the Xist and Tsiv domains. These data are in- 
consistent with a report that androgenetic 
hydatidiform moles lack X inactivation (83). 

We also compared both the XIC and 
chromosome-wide repeat content of the chro- 
mosome X from T2T-CHM13 and HG002 (XY). 
As expected, the XIC in HGO02 shows high 
methylation across the locus and only a single 
allelic cluster, with no detectable transcripts 
across the Tsix/Xist domain (Fig. 6B and table 
$25). Sequence comparison of the 269,020 re- 
peats assessed between the haploid X of HG002 
and T2T-CHM13 (Fig. 6C and tables S26 and 
$27), excluding the pseudoautosomal region (see 
fig. S52 for T2T-CHM13 PAR annotations), 
currently unassembled in HG002, uncovered 
778 repeat differences, of which 70% were simple 
repeats and 21% were TEs (64 of which were 
length outliers) (78) (fig. S53). Collectively, these 
data demonstrate that the depth of repeat 
annotations based on the T2T-CHM13 assembly 
can serve as a reference for studying human 
variation inclusive of repeats that affect local 
and regional chromatin, gene expression, and 
gene copy numbers. 

While many of the previously unidentified 
repeat classifications coincide with gaps filled 
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(12)] and the repeat annotation for an AluJb (121) fragment, which could not reliably be delineated in copy number fi 


in the T2T-CHM13 assembly, these data sup- 
ported genome-wide annotation of previously 
undiscovered repeats and TEs (Fig. 1B). To de- 
termine whether these repeat classifications 
were specific to humans, we searched for or- 
thologous sequences in the human reference 
GRCh38 and available genome assemblies for 
primates representing the great apes (Pan 
troglodytes, Gorilla gorilla, Pongo abelit), 
Hominoidea (Hylobates moloch), Catarrhini 
(Macaca mulatta, Rhinopithecus roxellana), 
Platyrrhini (Callithrix jacchus), and Strepsirrhini 
(Microcebus murinus) (18). 
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When comparing copy numbers of repeat 
annotations between T2T-CHM13 and long- 
read, high-quality assemblies available for other 
great apes (chimpanzee, gorilla, and orangutan) 
(84), we still find an increase in copy number 
across most of the repeats identified herein 
(Fig. 6D, fig. S2, and table $28). Many repeats 
appear only as monomers in other primate ge- 
nomes or are absent in Strepsirrhini, Platyrrhini, 
Catarrhini, and lesser apes; these reduced 
counts are largely influenced by the quality 
of these assemblies and potentially high rates 
of divergence among repeats, and they high- 


om other closely related full-length AluJb elements. 


light the need for telomere-to-telomere assem- 
bly approaches for comparative analyses (85). 
Finally, eight of the repeats identified herein 
are human-specific, with an additional 11 found 
only as monomers in other species (Fig. 6D and 
table S28). 


Conclusions 


The assembly of the complete, telomere-to- 
telomere human genome reference facilitated 
development of an atlas of repeats that make 
up >53% of the human genome. Through 
this collaborative effort, we have developed a 
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resource of human repeat annotations and 
methods to guide future efforts in exploring the 
complexities of repeat biology in human and 
other primate genomes. We focused on repeat 
sequence, CpG methylation, and transcrip- 
tional annotation; updated repeat models 
and implemented repeat modeling tools that 
supported the identification of previously un- 
known satellite arrays; expanded the catalog 
of variants for known repeats and TEs; and 
developed annotations for complex, compo- 
site repeat elements. Deeper exploration 
of such repeats revealed the complexity of 
genetic mechanisms that affect repeats 
during different phases of their life cycle 
and thus illustrate the myriad mechanisms 
by which they are major contributors to 
defining the structure and content of the 
human genome. 

For example, we found that a TE insertion 
event captured a short sequence, WaluSat, in a 
primate ancestor. Subsequent segmental dupli- 
cations of the region carrying this composite 
TE-sat repeat spread the sequence across sev- 
eral human chromosomes, including four of the 
acrocentric chromosomes. The satellite portion 
of the repeat expanded to almost 0.5 Mbp of 
sequence on the acrocentric chromosomes, 
resulting in the alteration of the structure of 
this portion of the chromosome into regions 
dense with G4s, which are potentially func- 
tional elements (86). This example highlights 
the need for future functional studies dissect- 
ing the impact of repeats on the local chromo- 
some environment, such as replication timing, 
local transcription, DNA damage and repair 
processes, and establishing TAD boundaries. 
Moreover, this example lays the groundwork 
for exploring the impact of local environments 
(such as gene-poor regions as found on the 
acrocentric arms of human chromosomes) on 
sequence constraint and mutation rates for 
emergent repeats. 

We provide a high-confidence functional 
annotation of repeats across the human ge- 
nome. For example, we find that the tandemly 
arrayed TE-derived satellite SST1 carries dis- 
tinctive methylation and transcriptional pro- 
files, including an enhancer embedded in each 
unit, found only in specific arrays on chromo- 
somes 19 and 4. These arrays are hypervariable 
in the human population, and alterations in 
their activity have been linked to cancer (36, 48). 
However, a full understanding of copy num- 
ber variation, epigenetic instability, and tran- 
scription of SST1 elements has been hampered 
by a lack of complete annotations of copies 
of these elements elsewhere in the genome. 
Our functional annotation revealed transcrip- 
tional signatures of both promoters and en- 
hancers within active SST1 elements that may 
affect local transcription and chromatin struc- 
tures. Moreover, this enhancer implicates SST1 
in defining cellular partitions, such as para- 
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speckles and phase-separated condensates 
(36, 87), that could have an impact on other 
genomic loci. 

Combined with defining the linear order and 
content of centromeric sequences (72), we find 
that engaged RNA polymerase signal is low 
across centromeric satellites arranged in arrays, 
irrespective of stages of the cell cycle or de- 
velopment. Rather, active transcription is de- 
tected in embedded retroelements coinciding 
with shifts in methylation states that demar- 
cate active centromere domains. To date, the 
centromere biology field has been limited by a 
lack of a linear assembly across human centro- 
meres, challenging the development of models 
to describe genetic and epigenetic elements 
that define centromeric chromatin. Our data, 
in concert with centromere annotations (72), 
reveal that these high-density repeat regions 
are not static in sequence, epigenetic, or tran- 
scriptional activity and that there is a high 
degree of substructure across the centromeric 
regions that affect function. Comparing the 
landscape of the variable centromere forms 
across domains of life, and in human disease, 
will reveal the complex life cycle of centro- 
meres (64). 

Studies of human genetic variation have 
been relatively blind to repeat variation among 
individuals, particularly arrayed and complex 
repeats, as these types of sequences are recal- 
citrant to short-read sequencing technologies, 
mapping, and functional annotation method- 
ologies. As a prospective of the utility of com- 
plete reference genomes in studying human 
genetic variation, we compared two T2T X 
chromosomes. We find 218 kbp of repeat dif- 
ferences between these two chromosomes 
(0.18% of the chromosome, excluding the 
1.9-kbp PAR), including repeat variation in 
complex arrays that carry exonic material 
and thus affect gene dosage. Thus, com- 
parative analyses of T2T-level assemblies 
reveal the potential for discovering an even 
wider range of repeat variation across the 
46 chromosomes that constitute the human 
genome. 

Finally, our work demonstrates the need to 
increase efforts toward achieving T2T-level 
assemblies for nonhuman primates to fully 
understand the complexity and impact of 
repeat-derived genomic innovations that 
define primate lineages, including humans. 
Although we find repeat variants that appear 
enriched or specific to the human lineage, in 
the absence of T2T-level assemblies from other 
primate species, we cannot truly attribute 
these elements to specific human phenotypes. 
Thus, the extent of variation described herein 
highlights the need to expand the effort to 
create human and nonhuman primate pan- 
genome references to support exploration of 
repeats that define the true extent of human 
variation. 


Materials and methods summary 

Repeat model discovery 

RepeatMasker4.1.2-p1 (88) with the Dfam3.3 
repeat library and RepeatModeler2.0.1 (8) were 
used to define repeats across the genome, fur- 
ther refined using extensive manual curation, 
as described in (18). This database was used to 
generate a final mask of the T2T-CHM13v1.1 
assembly. ULTRA (9) was used to improve the 
accuracy of tandemly repetitive satellite anno- 
tations. Gaps of >5 kb in T2T-CHM13v1.1 repeat 
annotations were identified with BEDtools (89) 
and manually curated. Monomer structure was 
confirmed using self-alignment plots. Repeat 
models were further refined to remove any 
false positives (e.g., fragments of other TEs, 
pieces of simple repeats), as described in (/8). 


Composite elements 


We defined a composite element as a repeat- 
ing unit consisting of three or more repeated 
sequences (TEs, simple repeats, subunits, and/ 
or satellites) found as a tandem array in at 
least one genomic location. A composite sub- 
unit is a previously unknown repeat annotation 
that is found within a composite. Whereas the 
locations of some composite elements within a 
family are present as a single copy and thus are 
likely segmental duplications derived by non- 
allelic homologous recombination (90), a com- 
posite family is distinguished by the presence 
of composite elements in an array in at least 
one location. 


LiftOver/reverse liftOver analyses 


LiftOver chains were generated from LASTZ 
alignments between GRCh38 and T2T-CHM13v1.1 
and X chromosomes of T2T-CHM13 and HG002 
with considerations as per (78). Reverse liftOver 
was performed from repeat annotations in 
both assemblies; BEDtools (89) was used to 
intersect the T2T-CHM13 coordinates with 
regions lacking synteny to GRCh38. Results 
were parsed into one of five categories: full 
match (i.e., SINE/Alu/AluSx), class match (i.e., 
SINE/Alw), family match (i.e., SINE), no match, 
and those set aside and subject to extensive 
manual curation to identify correct matches. 


Methylation metaplots 


Nanopore CpG methylation data for T2T- 
CHM13 and HG00O2 was processed as in (27). 
CpG methylation frequency was calculated 
by fraction of methylated reads to total cover- 
age within bins in T2T-CHM13 or HG002 with 
the BSgenome Bioconductor package (2/, 91). 
Multiples of three bins were further smoothed 
with the “rollmean” function from the R pack- 
age Zoo. Methylation clustering was performed 
by selecting all reads spanning a locus and 
using the mclust (v5.4.7) R package with the 
“VII” model to cluster methylation calls across 
the locus (92). CpG density heatmaps were 
calculated by counting the total number of 
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CpG sites per position relative to the repeat 
start and end and dividing by the total num- 
ber of repeats in each group. Methylation 
single-read plots were generated in the ggplot2 
R package using geom_rect() to plot individ- 
ual reads with methylated CpGs as red and un- 
methylated CpGs as blue. 


Identification and classification of full-length 
and truncated TEs 


Full-length elements of recently active TE fam- 
ilies [AlwY, LIHs, HERV-K, and SVA_E/F (93)] 
were retrieved from the RepeatMasker output 
and cross-referenced with PRO-seq data and 
CpG methylation data as per (78). All retroele- 
ment classes were grouped into relative age 
categories based on divergence and phyloge- 
netic distribution (6, 88, 94-99). LINEs, SINEs, 
and retroposons were grouped by subfamily; 
LTRs were grouped by family. 


PRO-seq 


For each of two PRO-seq replicates, cells were 
processed as per (/8, 22). PRO-seq libraries 
were prepared as previously described (22) 
with minor modifications (J00). Permeabi- 
lized cells were mixed with permeabilized 
Drosophila S2 nuclei in all 4-biotin-NTP run- 
on reactions. After amplification, libraries were 
polyacrylamide gel electrophoresis (PAGE)- 
purified to remove adapter-dimers and select 
molecules between 140 and 650 bp. Libraries 
were sequenced on an Illumina NextSeq 550 
(single-end, 75 bp). Raw fastq files were trimmed 
for quality, length, and adapters using cutadapt 
(101) and reverse complemented using the 
fastx-toolkit (102). Bowtie2 (703) alignment to 
Dm6 was used to remove Drosophila spike-in 
reads; remaining reads were aligned to T2T- 
CHM13 using default (“best match”) parame- 
ters (and k-100 for comparison); multimapping 
alignment files were subjected to single-copy 
k-mer filtering and processed into beds with 
BEDtools (89) for subsequent normalization 
with nonmitochondrial alignments to obtain 
counts in reads per million mapped (RPMM) 
as described in (78). Complementary analyses 
were performed on read data (unmapped) as 
outlined below and in (78). 


Statistical analyses and data visualization 


BEDtools (89) map was used to calculate av- 
erage methylation and CpG density across all 
repeats in RepeatMaskerV2 (RMv2) and in- 
corporated into 3D graphs and parallel plots. 
Genomic data were visualized using RIdeogram 
(v0.2.2) (104), Circos (v0.69-6) (105), and Circa 
(v1.2.2). Genome browser tracks and centro- 
meric satellite (cenSAT) annotations for T2T- 
CHM13 are as described in (//, 12, 21, 65). 
Heatmaps for PRO-seq profiles were gener- 
ated using deepTools2 (106). Normalized data 
were binned in 10-bp windows, and repeat 
elements were anchored to the 3’ end, with 
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the exception of HERV-K, which was divided 
into subcategories on the basis of length and 
presence of dual LTRs and scaled as per (8). 
The maximum value per bin and composite 
profiles were summarized by averaging each 
bin across all regions in the group; standard 
error was estimated and is shown in gray in 
each composite. Methylation heatmaps for 
HERV-K were generated in R ggplot2 by nor- 
malizing repeat size by start and end position 
and using geom_tile() to plot CpG methylation 
frequency at each position. For all other ele- 
ments, methylation heatmaps were anchored 
at the 3’ and using geom_tile() to plot CpG 
methylation frequency at each position. 


SSTI/LIHs embed analyses 


SST1 sequences were extracted from CHM13 
annotations via BEDtools (89) and aligned 
with MAFFT (107). The evolutionary history 
was inferred by using RAxML (J08) and the 
GTR+G model (J09) as matched by jModelTest 
(110); 100 bootstrap replicates reported. PRO- 
seq density for SST1 with <15 and =15 reads 
overlapping were determined by plotting the 
distribution of read overlaps across all anno- 
tated SST1 elements. BEDtools (v2.29.0) (89) 
was used to intersect SST1/L1Hs repeats with 
genomic locations, methylation (27), and tran- 
scriptional data. An unpaired ¢ test was per- 
formed to quantify differences among repeats 
in each group by repeat length, percent diver- 
gence, percent insertions, percent deletions, 
and average methylation. Violin plots were 
generated via GraphPad Prism (v9.1.1). 


HeLa cell cycle analyses 


Given the low rate of cell division and synchro- 
nization challenges in CHM13 cells, HeLa-S3 
cells were used, noting the caveat that this cell 
line carries high levels of karyotypic instability 
(111). HeLa-S3 cells were arrested as per (112), 
mitotic cells collected and subsequently grown 
for the corresponding time or immediately 
permeabilized (mitotic sample) as described 
in (/8). All time points were collected in rep- 
licate experiments. Before cellular permeabi- 
lization, 10% of each sample was removed, 
fixed in cold 75% ethanol, and stained with 
propidium iodide, and DNA content was ana- 
lyzed using a BD FACSAria II. The flowCore 
package was used to read FCS files into R. PRO- 
seq libraries (both replicates) were prepared 
as previously described (22), with minor mod- 
ifications as for CHM13 (78). All data were pro- 
cessed, mapped, and normalized as above for 
CHM13. Comparative and quantitative analyses 
are outlined below and described in (/8). 


H9 ChRO-seq data analyses 


External chromatin run-on and sequencing 
(ChRO-seq) data (GSE142316) for four develop- 
mental stages in replicate (ES, DE, duodenum, 
and ileum) (773) of H9 cells were used for com- 


parison to CHM13. H9 ChRO-seq data was pre- 
processed using the proseq2.0 pipeline to 
generate adapter-trimmed and deduplicated 
fastq files used for repeat composition analysis 
as per (18). 


Preprocessing, mapping, and postprocessing of 
RNA-seq data (CHM13 and HGO02) 


Data from two replicates of CHM13 paired- 
end native RNA sequencing (RNA-seq) using 
oligoDT (12) were processed with the same 
workflow as the CHM13 PRO-seq data, with 
minor modifications as per (78). External 
paired-end ribodepleted RNA-seq data for 
HG002 (GM24385) were used for compari- 
son, preprocessed as per CHM13 RNA-seq and 
mapped to a combined assembly of T2T-CHM13 
autosomes, HGO0O2 chrX, and GRCh38 chrY 
with Bowtie2. 


Comparative analyses of transcript 
quantification approaches 


To complement TE (herein) and centromere 
satellite repeat annotations (12), we imple- 
mented a three-pronged approach to define 
centromere transcription as described in (8): 
a mapping-dependent approach, in which PRO- 
seq (two replicates) and RNA-seq (two rep- 
licates) data were mapped and reads were 
intersected with single copy k-mers derived from 
the T2T-CHM13 assembly and whole-genome 
shotgun polymerase chain reaction-free reads 
(11, 114); a mapping-independent approach 
in which unmapped PRO-seq and RNA-seq 
reads were annotated using classification of 
ambivalent sequences using k-mers (CASK) 
and a T2T-CHM13-dependent k-mer database 
formed via T2T-CHM13 repeat annotations; 
and a genome-independent approach, in which 
PRO-seq and RNA-seq reads were processed 
through RepeatMasker using the human Dfam 
3.3 library. RepeatMaskerV2 (RM2) was inter- 
sected with cenSAT annotations to identify and 
label repeats adjacent to alpha satellites des- 
ignated HOR, dHOR, MON, or “none of the 
above” regions (RMv2-alpha). 

To compare across these three methods, 
BEDtools (89) coverage was used to obtain 
counts of reads overlapping repeats defined 
in RMv2 and RMv?2-alpha across all mapping 
methods, requiring at least 50% (~25 to 30 bp, 
roughly equivalent to the CASK k-mer length) 
of the read to overlap the repeat element [and 
see (/8)]. The relative abundance of each re- 
peat was similar across replicates; thus, counts 
from both replicates were summed. Variable 
bowtie mapping parameters (default, k-100, 
and k-100 filtered for single copy k-mers with 
multiple filters) on PRO-seq and RNA-seq 
datasets were assessed (78). 


WaluSat analyses 


The evolutionary history of WaluSat, AluSx, 
and the AluSx-WaluSat loci were inferred by 
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using the maximum likelihood method as 
described (18). Dotplots were generated by 
comparison of 1.5-kb sequences flanking both 
5' and 3’ regions adjacent to WaluSat inser- 
tions with FlexiDot as per (/8). G-quadruplex 
analysis was performed with G4Hunter (7/5). 


Transduction analyses 


TE transduction events were analyzed using 
the modified TSDfinder tool (67), filtering for 
artifacts such as segmental duplications and 
truncated elements, and refined on the basis 
of TE age using Kimura-2 distance parameters 
as described in (8). 


ChrX liftOver analysis and repeat 
fasta comparison 


Lifted T2T-CHM13 chrX to HG002 coordi- 
nates were compared (/8) using a similarity 
score as a percentage of the max score (>90% 
were considered concordant, <50 bp were in- 
sufficient; others were considered potentially 
polymorphic). Sequences of interest were 
filtered for length differences between the 
liftOver coordinates. Differences were sub- 
ject to manual curation depending on repeat 
type, and the final loci were subjected to 
RepeatMasker analysis. 


Copy number comparison across primates 


Copy number comparisons across primate 
genomes (78) were generated with the most 
recent, available primate genomes for each 
species: Pan troglodytes (accession: GCA_ 
002880755.3) (84), Gorilla gorilla (accession: 
GCA_900006655.3) (116), Pongo abelii (accession: 
GCA_002880775.3) (84), Hylobates moloch 
(accession: GCA_009828535.2), Macaca 
mulatta (accession: GCA_008058575.1) (117), 
Rhinopithecus roxellana (accession: GCF_ 
007565055.1) (118), Callithrix jacchus (accession: 
GCF_009663435) (119), and Microcebus murinus 
(accession: GCF_000165445.2) (120). BLAST 
was used to search each genome for individual 
instances of the corresponding repeat or com- 
posite element, requiring at least an 85% length 
match to the query repeat/composite monomer 
and a 100% match requirement across the 85% 
length for gap tandem arrays. 
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INTRODUCTION: The human reference genome 
has served as the foundation for many large- 
scale initiatives, including the collective effort 
to catalog the epigenome, the set of marks and 
protein interactions that act to control gene 
activity and cellular function. However, for 
more than two decades, efforts to construct a 
complete epigenome have been hampered by 
an incomplete reference genome. With recent 
technological advances, we can now study ge- 
nome structure and function comprehensively 
across a complete telomere-to-telomere human 
genome assembly, T2T-CHM13. As a result, we 
can now broaden the human epigenome to 
include 225 million base pairs (Mbp) of addi- 
tional sequence. 


RATIONALE: The epigenome refers to DNA 
modifications (e.g., CpG methylation), protein- 
DNA interactions, histone modifications, and 
chromatin organization that collectively influ- 
ence gene expression, genome regulation, and 
genome stability. These epigenetic features 
are heritable upon cell division but dynamic 


A Long-read coverage 


Short-read coverage 


during development, generating profiles that 
are unique to different tissues and cell types. 
Here, we present an epigenetic annotation of 
the human genome in which we explore pre- 
viously unresolved regions, including acro- 
centric chromosome short arms, segmentally 
duplicated genes, and a diverse collection of 
repeat classes, including human centromeres. 
Generating a complete epigenetic annotation 
of the previously missing 8% of the human 
genome provides a foundation for elucidating 
the functional roles of these genomic elements 
that are critical to our understanding of ge- 
nome regulation, function, and evolution. 


RESULTS: Completion of the human epigenome 
required that we develop approaches to profil- 
ing the previously unresolved regions. Using 
the T2T-CHM13 reference with existing short- 
read epigenetic data, we identified 3 to 19% 
more enrichment sites for epigenetic markers. 
However, even with the complete reference, 
these short-read epigenetic methods cannot 
correctly resolve regions of the genome of 
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Epigenetic characterization across a complete human genome. (A) The T2T-CHM13 reference 
contains filled gaps and corrected sequences. Using short- and long-read sequencing data, we functionally 
annotated these added regions. (B) Tandem repeats, which are nearly identical, vary in epigenetic 

state depending on genomic location. (C) The epigenetic basis of centromere identity is variable among 
diverse individuals. (D) In genes associated with disease, short reads mapped to T2T-CHM13 elucidate 


epigenetic dysregulation in human disease states. 
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high similarity, including segmental dup- 
lications, gene paralogs, or large repeat ar- 
rays. On the other hand, long-read epigenetic 
methods can resolve single-molecule epigenetic 
patterns within these regions by anchoring to 
flanking or infrequent unique regions, provid- 
ing a foundational assessment of these areas. 
Long-read methylation calls using the T2T- 
CHM13 assembly increased the number of 
probeable CpG sites by 10% (3.2 M), revealing 
epigenetic patterning of genomic regions that 
were previously intractable. We generated 
long-read methylomes of distinct develop- 
mental time points and surveyed >99% of the 
genome’s CpGs. We probed highly homologous 
gene families and observed paralog-specific 
differences in regulation between disease and 
nondisease states. In tandem repeats, we iden- 
tified differences in epigenetic regulation be- 
tween genetically identical sequences present 
across different genomic locations, observing 
locus- and single-molecule-level differences 
in methylation. Our analysis revealed that 
these regions vary in epigenetic and tran- 
scriptional activity despite high sequence 
identity, highlighting the importance of the 
local chromosome environment as a modu- 
lator of epigenetics. Finally, the T2T-CHM13 
genome assembly has opened exploration of 
the human centromere, enabling us to probe 
the epigenetic elements that define cen- 
tromeric chromatin. The centromere is the 
site of assembly of the kinetochore complex, 
an essential complex for eukaryotic cell divi- 
sion. We generated complete epigenetic maps 
of human centromeres, revealing epigenetic 
markers of centromere activity that denote 
active human kinetochores. We predicted 
kinetochore site localization within active 
centromeres and report variability of kineto- 
chore localization across individuals repre- 
senting diverse ancestry. 


CONCLUSION: The improvements in epigenetic 
profiling using T2T-CHM13 set the foundation 
for complete assemblies and long-read epi- 
genetics for major biological advancements. 
Using technological advances in genome re- 
sequencing and alignment, we present a com- 
prehensive functional assessment of previously 
unresolved genomic regions. This study marks 
the start of exploration into duplicated and 
repetitive portions of the epigenome, pio- 
neering the exploration of epigenetics in a 
complete human genome. 
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The completion of a telomere-to-telomere human reference genome, T2T-CHM13, has resolved 
complex regions of the genome, including repetitive and homologous regions. Here, we present a 
high-resolution epigenetic study of previously unresolved sequences, representing entire acrocentric 
chromosome short arms, gene family expansions, and a diverse collection of repeat classes. This 
resource precisely maps CpG methylation (32.28 million CpGs), DNA accessibility, and short-read 
datasets (166,058 previously unresolved chromatin immunoprecipitation sequencing peaks) to provide 
evidence of activity across previously unidentified or corrected genes and reveals clinically relevant 
paralog-specific regulation. Probing CpG methylation across human centromeres from six diverse 
individuals generated an estimate of variability in kinetochore localization. This analysis provides a 
framework with which to investigate the most elusive regions of the human genome, granting insights 


into epigenetic regulation. 


he human reference genome has served 
as the foundation for many large-scale 
epigenetic initiatives (7-3) that aimed to 
catalog regulatory elements involved in 
gene activity and cellular function. How- 
ever, efforts to construct a complete annotation 
of functional elements have been hampered 
by an incomplete reference genome. With 
recent technological advances, we are now 
able to study genome structure and function 
comprehensively across the finished, telomere- 
to-telomere human genome assembly, T2T- 
CHM13, which is based on the CHM13 cell 
line derived from a complete hydatidiform 
mole (4). As a result, we can now broaden the 
human epigenome to include 225 million base 
pairs (Mbp) of sequence, representing entire 
acrocentric chromosome short arms, gene 
family expansions, and a diverse collection of 
repeat classes. 
The epigenome is influenced both by the 
specific genetic sequence and the sequence 
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context, i.e., the flanking regions and place- 
ment of the loci within the complex structure 
and organization within the nucleus (5). The 
same genetic sequence can perform different 
functions or be regulated differently depending 
on the location of the sequence and its epi- 
genetic state. This is especially relevant given 
possible evolutionary advantages that may 
be conferred by gene duplication, such as 
selectively silencing or activating different 
paralogous gene copies. These processes are 
hypothesized to diversify gene activity across 
developmental time and different tissues (6). 
Beyond evolutionary questions, epigenetic dys- 
regulation of repetitive sequences can play a 
key role in development and human disease. A 
diverse set of repeat sequences that are diffi- 
cult to probe in the human reference genome 
GRCh38 have been implicated in facioscapulo- 
humeral muscular dystrophy (FSHD) (associ- 
ated with deletions in D4Z4) (7); schizophrenia 
(associated with an expanded repeat in TAF71) 
(8); neuroblastoma (associated with somatic 
hypomethylation of SST7) (9); lung cancer 
(associated with CT47 expression) (10); pan- 
creatic ductal adenocarcinomas (associated 
with HSat2 expression) (/7); and immuno- 
deficiency, centromeric region instability, 
and facial anomalies syndrome (ICF) (asso- 
ciated with heterochromatin abnormalities 
in HSat2,3) (12). 

Within the improved T2T-CHM13 reference, 
the previously unresolved areas are highly 
repetitive, containing only infrequent sites 
of unique, mappable regions. This presents 
a limitation to short-read sequence mapping 
strategies, even with a more accurate refer- 
ence and unique k-mer anchored alignments 
(73, 14). Emerging long-read technologies (15) 
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offer sequence lengths capable of spanning 
infrequent unique markers and provide a di- 
rect measurement of the base sequence and 
epigenetic state on single molecules (J6, 17). 


RESULTS 
Epigenetic profiles from a T2T genome in 
disease-relevant loci 


The T2T-CHM13 assembly resolves gaps and 
corrects misassembled or patched regions in 
GRCh38, leading to the introduction of nearly 
225 Mbp (4). Using existing short-read epige- 
netic data from the ENCODE project (1), we 
probed previously unidentified areas of the 
genome. To ensure accurate mapping to these 
regions, we intersected ENCODE chromatin 
immunoprecipitation-sequencing (ChIP-seq) 
alignments with unique k-mers of varying size 
of k (range, k = 50 to 100; fig. S1 and tables S1 
and 82) (/, 78). On average, 2.35% more reads 
mapped to T2T-CHM13 than GRCh38 across 
six different histone marks and CTCF, an im- 
portant regulator of chromatin architecture 
(fig. S2). Reads filtered out of GRCh38 due to 
non-unique mapping were largely confined to 
the satellite DNA and segmental duplications 
(SDs) (fig. S3). Although the total number of 
peaks called per sample was variable because 
of differences in cell type, all samples had an 
increase in the number of peaks called when 
comparing T2T-CHM13 with GRCh38 (Fig. 1A). 
As expected, we saw the most substantial in- 
crease in H3K9me3 (19.4%) and H3K27me3 
(15.2%) enrichment compared with GRCh38 
(Table 1), consistent with the introduced peri/ 
centromeric satellites (CenSat), SDs, and other 
repetitive sequences in T2T-CHM13 (Fig. 1A) 
that are associated with constitutive hetero- 
chromatin (79). The number of called peaks in 
activating marks increased as well; most no- 
tably, there was a 4.9% increase in H3K36me3, 
a mark present across active gene bodies. Pre- 
viously unresolved activating histone peaks 
(H3K27ac, H3K4mel, H3K36me3, and H3K4me3) 
and CTCF were primarily enriched in unique 
genic regions and in SDs (Fig. 1A). 
T2T-CHM13 increased the number of anno- 
tated genes by 5.7% (4), revealing 2680 genes 
exclusive to T2T-CHM13 with no assigned or- 
tholog in GRCh38 (78). These gene predictions 
require detailed study for functionality and 
validation. Here, we generated a functional 
annotation of the previously unresolved genes 
using activating peaks (H3K4me3 or H3K27ac) 
from ENCODE cell lines. We annotated acti- 
vating peaks from at least two ENCODE cell 
lines at the transcriptional start site at 57 of 
these previously unresolved genes (table S3). 
Of these loci, most (n = 20) were long noncoding 
RNAs (IncRNAs), including LINCO1666, which 
is known for its associations with gastric can- 
cer (20). Many (n = 19) were pseudogenes, in- 
cluding FSHD region gene 1 (FRGI), which is a 
poorly understood candidate gene for FSHD 
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Fig. 1. Epigenetics in 


previously unresolved 
genome regions. (A) Top: 

bar plots of the number 

of peaks called per ENCODE 
sample using dynamic 

k-mer mapping to GRCh38 
(blue) or T2T-CHMI13 (salmon). 
Bottom: pie charts indicating 
the genomic localization of 
peaks found only in T2T- 
CHM13. (B) Number of T2T- 
CHM13 unique ENCODE peaks 
across chromosomes 5, 6, 

15, 16, 17, and 19 in 50-kb bins 
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(21). Three were protein-coding genes, in- 
cluding BOLBA2B, one of the most common 
genes associated with autism (22). 

Our analysis of previously unresolved 
ENCODE peaks revealed enrichment of peaks 
for high-copy-number gene families (e.g., GOLGA, 
NPIP, ZNF, and TBCID3) (Fig. 1B). Large struc- 
tural variants resolved in T2T-CHM13 explain 
the additional ChIP-seq mapping events (fig. 
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GRCh38 T2T GRCh38 


S4). Epigenetic annotation at these genetic loci 
may lead to insights of paralog-specific func- 
tion in evolution (e.g., human-specific neu- 
ral genes) and disease (23, 24). For instance, 
SMNI1I/2 is associated with spinal muscular 
atrophy (SMA) and was historically one of 
the most difficult regions to assemble (25). 
At the SMN2 gene, we observed peaks of the 
activating H3K4me3 mark at the promoter 
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in all four ENCODE cell lines analyzed (fig. 
$5), indicating high transcriptional activity 
of the gene across tissues. SMA is a leading 
cause of childhood death (26) and has the 
potential to be treated by regulating expres- 
sion through histone deacetylase inhibitors, 
but understanding the disease-specific epi- 
genetic differences between paralogs has been 
challenging (27). 
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Table 1. Peaks called using ENCODE datasets. Shown is a summary of ENCODE peak analysis including the mark profiled, summed peak calls per mark 
across all datasets, difference in peak number between references, and number of datasets. 


Mark 
H3K9me3 


Peaks called in GRCh38 
194,681 


Peaks called in CHM13 


241,497 


Difference in no. of peaks 


No. of datasets 


Increase in peaks 


249,945 


327,713 


611,645 


294,819 
342,284 


632,837 


H3K4me3 


Another previously intractable region of the 
genome, the HLA locus, is critical for under- 
standing a wide range of biology from im- 
munity to neuropsychiatric disorders (28, 29). 
Our results reveal enrichment of ENCODE 
peaks across a variety of histone marks at the 
HLA locus (Fig. 1B and fig. S6A). Decreasing 
expression of HLA genes is associated with 
soft tissue cancers, particularly prostate can- 
cer, and can even be indicative of chemother- 
apy resistance (30). Comparing non-neoplastic 
adult human prostate epithelial cells (RWPE-1) 
and the c-Ki-ras-transformed prostate cancer 
model cells from the same donor (RWPE-2) 
(31), we observed a decline in H3K27ac, an 
activating mark, at HLA gene promoters, con- 
comitant with an increase in CTCF binding in 
RWPE-2 (Fig. 1C and fig. S6B). The differences 
in histone marks in this region indicate epi- 
genetic dysregulation of the HLA locus in pros- 
tate cancer that may warrant further studies 
and inform upon potential therapies (32). 


Long-read sequencing to derive complete 
human methylomes 


Methylation profiling has traditionally faced 
difficulties in mapping success rates to repet- 
itive regions of the genome, and such mapping 
inefficiencies are exaggerated by the bisulfite 
conversion of unmethylated cytosine to uracil, 
sequenced as thymine (33). Methylation pro- 
files in T2T-CHM13 using long-read nanopore 
data demonstrate an increase in the genome 
coverage (32.8 M compared with 29.17 M in 
GRCh38, omitting chromosome Y) and sur- 
veyed more CpGs (10%, 3.18 M) compared with 
short-read whole-genome bisulfite sequencing 
(WGBS) (Fig. 1D). We called nanopore meth- 
ylation data with Nanopolish (34), finding a 
high correlation (R = 0.937) both to WGBS re- 
sults in regions mappable by both data types 
(Fig. 1E) and to the alternative nanopore meth- 
ylation caller, Megalodon (R = 0.952) (fig. S7). 
Examining the difference between mapping 
of WGBS and nanopore methylation data, 
we generated short-read mappability scores 
in 200-bp windows, with a score of 0 being 
unmappable and 200 being highly mappable 
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(18). We found that the 165 Mbp of sequence 
with a score of 0 (highly unmappable) is en- 
riched in SDs and satellite DNA. Stratifying 
the nanopore data by read length, we found 
reads longer than 50 kilo-base pairs (kb) were 
capable of accurately determining methyla- 
tion in these regions (figs. S8 and S9). 

We sequenced the CHM13 cell line, represent- 
ing an early developmental state, and HG002, 
aterminally differentiated lymphoblast cell line. 
The sequenced cell line CHM13 and HG002 
nanopore datasets surveyed 32.19 M CpGs 
(99.7% of total) and 32.26 M CpGs (99.9% of 
total). As expected for differentiated cell lines, 
most of the HGO0O2 genome is methylated 
(75% median methylation), with a secondary 
peak of unmethylated CpGs largely reflect- 
ing unmethylated CpG islands (CGIs) (fig. 
$10). By contrast, CHM13 is markedly hypo- 
methylated (36.8% median methylation), as 
expected from a trophoblastic cell line (35). 
Comparing CHM13’s methylation state with 
existing DNA reduced representation bisulfite- 
sequencing data on early human embryos (fig. 
S11 and table S4) (35), we observed that CHM13 
clusters closely with cleavage and blastocyst- 
stage embryos as well as trophectoderm tissue. 

To probe chromatin state in repetitive DNA, 
we generated long-read nanoNOMe data on 
HG002 using M.CviPI methyltransferase to 
decorate accessible chromatin with exogenous 
GpC methylation (76) and called CpG and GpC 
methylation with Nanopolish to measure chro- 
matin accessibility (figs. $12 and S13). With the 
combination of long-read epigenetic data and 
the complete human reference, we now de- 
scribe a complete human epigenome, providing 
a foundation for further study. 


Paralog-specific epigenetic regulation 


The neuroblastoma breakpoint family (VBPF) 
family of genes has been implicated in the ex- 
pansion of the human prefrontal cortex since 
our lineage diverged from apes (36). One of its 
copies, NBPFI, has been reported to act as a 
tumor suppressor in neuroblastoma, in which 
hypomethylation of CGIs has been associated 
with astrocytoma formation (37). Understanding 
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the regulation of this gene family, however, has 
been particularly challenging because the NBPF 
genes correspond to large, high-identity dupli- 
cations (>98%) that are copy number poly- 
morphic among humans and map to gaps in the 
existing reference sequences (38). The fully 
resolved nature of T2T-CHM13 allowed us to 
remap ENCODE data to discover regulatory 
elements associated with this gene family 
(Fig. 2A). When comparing the balance between 
H3K36me3, a mark of active exons and gene 
bodies, and H3K27me3, a repressive mark, in 
samples including the BE2C cell line (neu- 
roblastoma) and primary brain microvascular 
tissue (normal brain), we found that BE2C shows 
ahigher proportion of H3K27me3 peaks (BE2C 38, 
brain 8) and a lower proportion of H3K36me3 
peaks (BE2C 36, brain 89) at NPBF loci (fig. 
$14 and table S5). Taking advantage of the 
increased resolution and more accurate NBPF 
copy number provided by T2T-CHM13 (39), we 
assayed paralog-specific epigenetic changes oc- 
curring in neuroblastoma (Fig. 2B). Among 
the different NBPF gene copies, the largest 
shifts in epigenetic regulation occurred at 
NBPF26 and NBPFIO, moving from active 
marks in primary brain microvascular tissue 
to repressive marks in BE2C. These specific 
NBPF copies are noteworthy because they as- 
sociate with the human-specific duplicate genes 
NOTCH2NLA and NOTCH2NTER, determinants 
of the size and complexity of the human neo- 
cortex (40). This association identifies the 
functional NBPF copies, emphasizing the im- 
portance of studying paralog-specific epigenetics 
for the discovery of potential drug targets. 
Regulatory regions are excluded because of 
low short-read mappability scores among high 
identity paralogs as in the NBPF gene family 
(Fig. 2C and fig. S15) (78). We found that 
genome-wide methylation, H3K4me2 (a mark 
of active promoters), and H3K27me3 (a rep- 
ressive mark) correlate with Iso-Seq coverage 
(transcription) and together can be used to 
systematically evaluate the functional activity 
of this gene family (fig. S16). We correlated 
this activity with the evolutionary age of the 
paralogs, estimated using NBPF gene paralogs 
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Fig. 2. Paralog-specific epigenetic regulation of the NBPF gene family. 
(A) Location of T2T-CHM13 previously uncalled ENCODE peaks across 
chromosome 1 in 50-kb bins (purple). Chromosome ideograms contain the density 
of previously unannotated genes (red) and centromere annotations (dark gray). 
NBPF paralogs are indicated by black arrows (top). (B) Heatmap illustrating the 
number of peaks for H3K36me3 (orange) and H3K27me3 (purple) per NBPF 


mappable. Coverage tracks (Illumina WGBS and ONT) and CUT&RUN tracks 
display read pileups. Long-read methylation tracks show base-level methylation 
frequency, with 0 as unmethylated and 1 as fully methylated. The long-read HGO02 
accessibility track is a 200-bp binned Z-score of nanoNOMe GpC methylation 
frequency. Dashed boxes highlight the promoter region that is largely unmappable 
with short-reads. (D) Bottom: younger NBPF12 gene paralog displaying CHM13 and 


paralog in the ENCODE cell line BE2C (neuroblastoma) 


brain microvascular tissue). Arrows indicate NBPF10 and NBPF26. (C) Epigenetic 
data at the NBPF10 promoter and first intron (chromosome 1: 
145,348,763). Short-read mappability score from 0 to 200 calculated as a 200-bp 
region, with a score of 200 being the most mappable and 0 being the least 


from six nonhuman primates from local ge- 
nome assembly of the NBPF gene family from 
each primate (39) (fig. S17). The oldest paralog, 
NBPFI7P, has low Iso-Seq coverage correlated 
with an epigenetic signature consistent with 
arepressive state, including promoter hyper- 
methylation and inaccessibility, enrichment 
of H3K27me3, and decline of H3K4me?2 (Fig. 
2D). By contrast, the younger paralogs, in- 
cluding human-specific copies, have higher 
Iso-Seq coverage and epigenetic signatures 


and brain tissue (primary 


45,300,425 to 


methylated and accessible promoters and en- 
richment of H3K4me2. Activity in the younger 
paralogs is more variable, with NBPFI0 and 
NBPF20 displaying high functional activity 
and sharing promoters with NOTCH2NLA and 
NOTCH2NLB. Taken together, our results illus- 
trate the role of epigenetics in the regulation 
of gene paralogs, silencing evolutionarily older 
paralogs while activating newer copies. This 
provides mechanistic insight into potentially 
functional genes related to human-specific cor- 


consistent with active genes, including hypo- 
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tical expansion and dysregulation in neoplasia. 
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HGOO2 nanopore methylation, CHM13 H3K4me2 and H3K27me3 CUT&RUN 
coverage, and HGO02 nanoNOMe. Top: older NBPF1/P gene paralog displaying 
CHM13 and HGOO2 nanopore methylation, CHM13 H3K4me2 and H3K27me3 
CUT&RUN, and HGOO02 nanoNOMe. Numbers in parentheses refer to the number 
of PacBio Iso-seq transcripts mapped to this paralog. 


Array-specific epigenetic regulation of 

tandem repeats 

Using k-mer-directed ENCODE alignments to 
the T2T-CHM13 reference, we report epigenetic 
features from human centromeric regions, sub- 
telomeres, and acrocentric short arms, which 
represent previously unresolved regions of the 
genome that are dominated by CenSat DNAs 
(fig. S18). Five different ENCODE lines had 
an enrichment of H3K9me3 in CenSat DNA, 
notably observed in short-read mappable re- 
gions of the acrocentric short arms (fig. S19). 
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SJCRH30 (a rhabdomyosarcoma-derived line) 
had lower H3K9me3 enrichment in CenSat 
compared with the rest of the chromosome, 
suggesting satellite epigenetic dysregulation 
as a clinically relevant pathology in rhabdo- 
myosarcoma (figs. S19 and S20A and B). This 
trend can be observed with more detail in an 
HSat3 repeat on the acrocentric arm of chromo- 
some 15, where H3K9me3 in SJCRH30 is clear- 
ly depleted compared with HAP-1 (fig. S20C). 

In contrast to these heterochromatic marks, 
we found enrichment of activating marks, 
including H3K27ac, H3K4me3, and CTCF 
in the telomere-associated repeat (TAR) re- 
gion, typically located 2 kb upstream from 
the canonical telomeric repeat. A CTCF site 
in the TAR loci drives transcription of the 
TERRA IncRNA (4), a negative regulator of 
telomerase-mediated telomere elongation. We 
observed enrichment of CTCF in all ENCODE 
cell lines at the TAR loci (fig. S21A), but the 
subtelomeric regions were rich in SDs, resulting 
in the TAR sequence being dispersed through- 
out the genome (42). When comparing telo- 
meric TAR sequences with nontelomeric TAR 
sequences, we did not observe statistically sig- 
nificant differences (Kruskal-Wallis, P = 0.12) 
in sequence divergence (fig. $21B). Although 
both telomeric and nontelomeric TAR se- 
quences are enriched for CTCF, the nontelo- 
meric TAR sequences are more enriched for 
the activating chromatin marks H3K27ac and 
H3K4me3, suggesting differences in TERRA 
activity. 

Examining nanopore CpG methylation in 
tandemly repeated satellite DNA elements in 
CHM13 and HG002 revealed hypomethylation 
in CHM13 compared with HG002 (Fig. 3A) 
(43). To assess the chromatin profile of satellite 
repeats, we called accessibility peaks from the 
HG002 nanoNOMe data (78). We found that, 
when corrected for the size of the region, re- 
peats have lower peak density than the ge- 
nome as a whole. The number of nanoNOMe 
peaks per megabase of sequence was lower 
in satellite DNA (1.5), LINEs (8), SINEs (15), 
and LTRs (13.4) compared with the whole ge- 
nome (31.8) (Fig. 3B and table S6) (44). The 
human satellites (HSat 2,3) and monomeric 
alpha satellites (MON) were largely devoid of 
accessibility peaks. Repetitive DNA is typically 
associated with densely packed heterochro- 
matin (45), and our findings are consistent 
with this association and transcriptional pro- 
files from (44). However, our data allow us to 
investigate accessibility profiles within pre- 
viously unmappable satellite repeats. 

Contrary to the expectation of compact chro- 
matin and satellite DNA, we discovered enrich- 
ment of accessibility peaks in the SST1 satellite 
both inside the CenSat (41.4 peaks/Mbp) and 
in the chromosome arms (198.1 peaks/Mbp). 
Our peak annotations in HG002 were con- 
sistent with (44), which showed higher activity 
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in CHM13 at noncentromeric arrays on chro- 
mosomes 4 and 19 compared with other SST1 
arrays (table $7). After the SST1, the satellite 
repeat with the second highest peak enrich- 
ment was the ACRO_Composite, a’7-kb repeat 
found across 12 chromosomes, including as 
tandemly arrayed sequences across the five 
acrocentrics with high sequence identity across 
composite units (44). The tandemly arrayed 
promoter elements in the ACRO_composite 
give rise to a periodic bimodal methylation 
structure across the array (Fig. 3C). This epi- 
genetic pattern has been proposed to be im- 
portant for both the efficient transcription of 
noncoding RNAs and the maintenance of the 
nearly perfect tandem arrays (46). The array 
has regions of increased CpG methylation that 
were associated with nanoNOMe peaks and 
transcription (CHM13 PRO-seq) (Fig. 3C). We 
quantified nanoNOMe peak densities across 
the ACRO_Composite between chromosomes 
and found that chromosome 21 has the highest 
(4.5 peaks/100 kb) and chromosomes 13 and 
15 have the lowest (0 peaks/100 kb) (Fig. 3D). 
The absence of nanoNOMe peaks in chro- 
mosomes 13 and 15 is correlated with low 
transcriptional activity (fig. S22). This high- 
resolution look within the acrocentric repeats 
indicates chromosome-specific activity of the 
ACRO_Composite across both CHM13 and 
HGO002, suggesting a persistent functional 
role for the ACRO noncoding RNA through- 
out early- and late-stage development. 

By contrast, we also observed methylation 
periodicity in untranscribed satellite repeats 
such as the HSat2, and these regions were 
largely inaccessible as measured by nanoNOMe 
(Fig. 3E) (44). This periodicity in methylation 
corresponds to the underlying chromatin 
structure and echoes the genetic repeat size, 
suggesting the presence of functional genomic 
elements. Our initial epigenetic assessments 
of these assembled satellite sequences indicate 
a complicated regulatory structure stretching 
beyond the accepted notion that the repetitive 
fraction of mammalian genomes is entirely 
methylated and repressed by a highly con- 
densed chromatin state (47). 


Single-read-level analysis in satellite arrays 
reveals array heterogeneity 


Long reads, coupled to a complete reference 
assembly, confer the ability to explore meth- 
ylation patterns of single molecules, each of 
which represents the methylation pattern of a 
single allele from a single cell. The X chromo- 
some provides a unique opportunity to study 
these patterns because of the role of allele- 
specific methylation in X chromosome in- 
activation (XCI). Female somatic tissues have 
a mixture of paternal or maternal X expression 
because the same X chromosome is not always 
repressed; therefore, the active X (Xa) and 
inactive X (Xi) cannot be distinguished with 
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heterozygous single-nucleotide polymorphisms 
alone. Examining methylation state at CGIs, we 
clustered reads on the CHM13 X chromosome 
as hyper- or hypomethylated (78). To determine 
whether the clusters represent the Xa and Xi, 
we first focused on genes known to be subject 
to XCI (XCI genes) or known to escape in- 
activation (escape genes) and compared our 
results with the clonal female lymphoblast cell 
line GM12878, in which the Xi is always the 
paternal allele (fig. S23A and B) (48). There, 
we found the Xa to have hypomethylated pro- 
moters and hypermethylated gene bodies com- 
pared with the Xi (49). However, in CHM13, we 
discovered that not all genes (e.g., TAF9B and 
PRKX) were properly regulated, with TAF9B 
escaping XCI and PRKX being subject to XCI, 
contrary to expectation. This is likely due 
to failure of X chromosome inactivation in 
androgenetic CHMs (fig. S23C and D and 
table S8) (50). 

Moving this analysis into repetitive regions, 
we analyzed DXZA, a satellite that acts as a 
major epigenetic regulator of XCI (57). This 
165-kb macrosatellite repeat contains 3-kb 
monomeric units, each with a bidirectional 
CGI promoter and a CTCF site that is hypo- 
methylated on the Xi and hypermethylated 
on the Xa in healthy cells (52, 53). Single-read 
clustering revealed two distinct clusters of 
reads, one with higher methylation across 
the repeat and the other with lower methyla- 
tion (Fig. 3F). This analysis revealed an un- 
expected level of heterogeneity in methylation 
of monomers within the array. We hypothesize 
that this variation is a result of the aberrant 
XCI state of CHM13, because intra-array var- 
jation was not observed in the Xa at DXZ4 in 
HG002 (fig. S24). Observing epigenetic dif- 
ferences between monomers of satellite repeats 
could grant insights into human disease, pro- 
viding a detailed mechanistic understanding 
of satellite dysregulation. From this analysis, 
we demonstrate that we can cluster reads 
using methylation alone to identify heteroge- 
neous populations and intra-array epigenetic 
variation even in the absence of heterozygous 
genetic variants. 


Methylation maps of human centromeres reveal 
complex epigenetic patterns 


Human centromeres are composed of alpha 
satellite DNA, with an AT-rich, ~171-bp repeat 
unit or “monomer.” The largest arrays of alpha 
satellites in the human genome are further 
organized in chromosome-specific, higher- 
order repeats (HORs) or larger, multimono- 
meric repeat units (54). Centromeres can 
contain multiple distinct alpha satellite HOR 
arrays that can be classified into active and 
inactive HORs (55, 56). The HORs within ac- 
tive arrays have specialized epigenetic reg- 
ulation that are important in establishing 
and maintaining centromere identity (56, 57). 
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and within different satellite repeats (bottom). (©) Nanopore CpG methylation 
profiles, HGOO2 NanoNOMe accessibility peaks and Z-score (negative is 
inaccessible, positive is accessible), and non—kmer-filtered (multimapping) 
PRO-Seq coverage at the ACRO_Composite repeat (chromosome 14: 121,193 

to 162,142). Annotation tracks at the bottom are the RepeatMasker V2 annotation 
from (44), monomeric annotations of the ACRO_Composites, and a GC density 
track. (D) Ideogram showing the arrayed locations of the ACRO_Composite 


3 


114.21 
i 


114.25 114.30 114.48 114.19 114.20 | 
i 


iF 
if 
iF 
i 
f} 
| 
DXZ4 Unit Poe” DXZaUnit “Sec 


across the acrocentric chromosomes (purple) within the acrocentric short 
arms (gray shaded). Listed above each chromosome is the nanoNOMe 
ACRO_composite peak density in peaks/100 kb. (E) Nanopore CpG methylation 
profiles and HGOO2 NanoNOMe accessibility Z-score of the HSat2 repeat 
(chromosome 16: 49,163,529 to 49,239,753). Annotation bars below represent 
CpG density and HSat2 repeat units on the bottom. (F) The DXZ4 locus on CHM13 
clustered into two haplotypes (low CGI methylation and high CGI methylation) 
based solely on promoter methylation state. Left: methylation frequency plot 

of each haplotype. Right: single reads from the gray highlighted region on the 
left, with boxes highlighting CGI cluster group-level epigenetic variability and 
intra-array-level epigenetic variability between neighboring monomeric units. 


Centromere protein A (CENP-A) is an H3 
variant that is enriched in centromeric nu- 
cleosomes and marks sites of kinetochore 
assembly (58). In HOR arrays, notable hypo- 
methylation colocalizing with CENP-A enrich- 
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ment at chromosomes X and 8 has been 
described (13, 14). We extended this finding 
to all CHM13 centromeres, referring to this 
hypomethylation as the centromeric dip re- 
gion (CDR) (Fig. 4A and table S9). We found 
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that CDRs were present only in active HORs 
(fig. S25), and that active HORs were larger 
in size and had higher mean methylation fre- 
quency than inactive HORs, as exemplified 
by the chromosome 5 centromere (Fig. 4B). 
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plot. Chromosomes 3 and 4 have an HSat1 repeat (blue highlight) that breaks up 
the live HOR array. (B) Left: CHM13 methylation in the centromeric region of 
chromosome 5. Smoothed methylation frequency is plotted in 10-kb bins. HOR 
arrays are annotated as blue (“active”) and pink (“inactive”). Right: scatter plot 
of average methylation within each HOR array versus size in megabase pairs. 


These results underscore the importance of 
methylation in proper centromere regulation 
and kinetochore assembly. 

To investigate whether CDRs were confined 
only to early developmental samples, we exam- 
ined HGO02 nanopore-sequencing data to 
probe centromere methylation in an adult 
differentiated cell line. However, the high 
level of HOR array variability and the result- 
ing inability to confidently phase and map 
reads from diploid chromosomes prevented 
us from using the T2T-CHM13 HOR reference 
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for HG002 reads, as evidenced by the anom- 
alous coverage that we observed for HG002 
alignments in the HOR arrays (fig. S26) (59). 
Instead, we took advantage of the haploid 
nature of the HGO02 X chromosome and used 
an HG002-specific X centromere reference 
(4, 56), and were able to clearly observe a CDR 
(Fig. 4C). Furthermore, using nanoNOMe, the 
CDR was coordinated in this sample with a 
highly inaccessible region. When we examined 
the size of the inaccessible regions in the HOR 
versus the surrounding pericentromeric and 
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as raw read counts with input shaded gray. Bottom bar annotates satellite 
regions indicating the location of the HOR, MON, GSat, HSat4, and CT regions. 
(D) Methylation in the active HOR array across diverse individuals. Coriell cell line 
sample ID and cenhap group are annotated to left. HORs are annotated as 

red (younger) and gray (older) computed on the basis of sequence divergence. 


centromeric transition (CT) regions, we found 
that the HORs were enriched in dinucleosomes 
compared with these other regions (fig. 
$27). Finally, looking at CUT&RUN CENP-A 
and centromere protein B (CENP-B) data, 
we observed a peak of CENP-A and CENP- 
B binding at the CDR. This is coordinated 
with a marked hypomethylation of the 
CENP-B motif within CDRs as opposed to 
outside the CDRs (fig. S28), and methylation 
is known to reduce CENP-B binding (60). This 
finding highlights the potential functional 
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importance of the CDR for kinetochore 
formation. 

Taking this a step further, using Human 
Pangenome Reference Consortium (HPRC) 
data, we leveraged the assembled X chromo- 
somes of four additional diverse male sam- 
ples representing individuals included in the 
1000 Genomes Project (Fig. 4D) (56, 67). All 
arrays showed a distinct CDR in the X chro- 
mosome, with positional variability in the 
CDR location across individuals. Furthermore, 
CDR position was shared between individuals 
with more closely related centromere-spanning 
haplotypes (cenhap) assignments. 

Cenhaps are long haplotypes that include 
centromere arrays due to reduced recom- 
bination in CenSat regions (56, 62). Three 
of the samples, CHM13 (European), HG002 
(European), and HGO01109 (Puerto Rican), are 
within cenhap group 2, and all contain a cen- 
trally positioned CDR within an evolutionarily 
“younger” region of the HOR, as defined in (56). 
Two of the samples, HG01243 (Puerto Rican) 
and HG03492 (Pakistani), are within cenhap 
groups 3 and 1, which have been shown to be 
phylogenetically related (i.e., they share a clade 
with cenhaps 1 to 4) (56), and have a CDR 
positioned more toward the q-arm side of the 
centromere within the evolutionarily younger 
region of the HOR array. Finally, one of the 
samples, HG03098 (African), from the more 
distantly related cenhap group 9, has a CDR 
positioned toward the p-arm of the centro- 
mere in an older (more diverged) region of the 
HOR array (supporting the previous observa- 
tion of an epi-allele in the region using avail- 
able short-read datasets) (56). Therefore, we 
demonstrate the use of CDRs to identify epi- 
genetic variability within human centromeres, 
variations that may influence centromere func- 
tion during cell division. These variations show 
the critical importance of epigenetic profiling 
in the centromere, finding variation between 
individuals in a discrete, epigenetically defined 
region of the centromere. 


DISCUSSION 


This work provides a comprehensive view of 
the epigenetic organization of a complete 
human genome, uncovering complex epige- 
netic patterns in the previously unresolved 8% 
of the human genome. Functional annotation 
of these intractable regions has not been over- 
looked because of their lack of importance but 
rather because of technological limitations. 
Our study opens these regions to explore their 
epigenome, leaving no region of the genome 
unreachable. Here, with the combination of a 
complete genome assembly and technological 
advances in epigenetic profiling, we make sub- 
stantial strides in functional genome assess- 
ment, expanding ENCODE (J) to include 3 to 
19% more peak calls and increasing the num- 
ber of CpG methylation calls by 10%. Long-read 
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epigenetic methods, here focusing on nano- 
pore methylation and chromatin accessibility, 
can resolve single-molecule epigenetic patterns 
within these regions, providing a foundational 
assessment of these areas. Long-read methyl- 
omes of distinctive developmental time points 
surveyed >99% of CpGs, establishing the CHM13 
and HG002 methylomes as the most complete 
human methylomes to date (3). With these data- 
sets, we profiled the additional 225 Mbp of se- 
quence and 2680 gene annotations. 

Of the previously unresolved genes, we found 
57 with evidence of active promoters, including 
H3K4me3 or H3K27ac marks, in more than 
one cell type. We found 82 genes with a single 
cell type supporting active promoters, provid- 
ing evidence that these previously unresolved 
gene annotations are functionally active across 
tissues. With more data from different tissue 
types, we may identify even more functional 
genes. More generally, we found that evolution- 
arily older gene paralogs were epigenetically 
repressed (similar to the epigenetic silencing of 
transposons), conferring genome stability and 
thus influencing genome evolution (63, 64). 

Examining satellite DNA, we integrated 
short- and long-read datasets to investigate 
complete satellite arrays, revealing that these 
regions vary in epigenetic and transcriptional 
activity despite high sequence identity and 
highlighting the importance of the local chro- 
mosome environment as a modulator of epi- 
genetics. Repetitive DNA on the acrocentric 
short arms is known to play a role in nucleolar 
formation; however, the previous absence of 
these regions from the human reference has 
hampered research (65). Our findings suggest 
that, rather than acting in unison, the repeat 
families on these individual acrocentric chromo- 
somes all have their own epigenetic identity, 
likely contributing to their unique functional 
roles in genome integrity and organization. 

One of the features of our single-molecule 
epigenetic data is the ability to investigate 
single-molecule patterns of epigenetics. We 
used methylation alone to cluster reads in 
repetitive areas devoid of heterozygous poly- 
morphisms. This includes the DXZ& array, in 
which the methylation signature is critical to 
X chromosome inactivation (66, 67). With the 
increase in resolution, our results show meth- 
ylation variability between the clustered pop- 
ulations and intra-array epigenetic variation 
within adjacent monomers in the same array. 
Because satellite arrays are known to be hyper- 
variable in the human population and linked 
to several human diseases, these results high- 
light the importance of long-read single- 
molecule epigenetic studies for understanding 
disease pathology. 

Finally, the T2T-CHM13 genome assembly 
has opened exploration of the human cen- 
tromere, enabling us to probe the epigenetic 
elements that define centromeric chromatin. 
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We extended our original discovery of the 
CDR in chromosome 8 and chromosome X to 
all chromosomes, and found that CDRs denote 
the position of centromere-associated proteins 
(CENP-A and CENP-B in the HG002 genome) 
in differentiated cells (HGOO2, a lymphoblast). 
This provides evidence of CDRs outside of early 
developmental CHMs and emphasizes their 
importance in kinetochore positioning and 
epigenetic regulation of chromosome segrega- 
tion. Expanding our CDR analysis to male X 
chromosomes representing diverse haplotypes, 
we uncovered variability in the localization of 
the CDR within the X HOR array. Such var- 
jiability in active centromeric arrays has been 
explored through the presence of epi-alleles 
(68); however, we have been able to demon- 
strate the use of CDRs to precisely predict 
kinetochore site localization within an active 
array and report across individuals represent- 
ing diverse ancestry. When combined with 
findings in other organisms, e.g., maize (69) 
and medaka (70), this suggests that the CDR 
is a conserved, functionally important feature 
of complex centromeres across vertebrate and 
plant lineages. Proper kinetochore formation is 
an essential process for eukaryotic cell division, 
a process that occurs in humans 330 billion 
times per day to sustain life. Our results lead to 
two major conclusions about the CDR: (i) CDR 
location on a given array is fixed in early de- 
velopment and maintained upon differenti- 
ation and (ii) there is a single stable CDR in 
each centromere. Our initial profile provides a 
multitude of avenues for future research, in- 
cluding how CDR position influences meiotic 
and mitotic stability, disease, and aneuploidy. 

Our results act as a foundational study, ex- 
panding studies of the human genome through 
the use of the complete reference. There remain 
significant challenges to further exploring the 
epigenome in a larger and more diverse sam- 
ple set to achieve optimal sequence alignment, 
especially among structurally variable repeti- 
tive regions, e.g., HORs. Efforts by the HPRC 
(71) to generate fully phased diploid genome 
assemblies will enable population-scale explo- 
ration of these areas. Limitations of short-read 
sequencing in unique regions can be supple- 
mented by long-read epigenetic methods cur- 
rently under rapid development (J6, 7). We are 
on the precipice of exploration into duplicated 
and repetitive portions of the genome. Further 
development of long-read epigenetic profiling 
across different populations and disease states 
will reveal more about regulation within the 
genome’s most elusive regions. 


METHODS SUMMARY 

Methylation processing 

Nanopore reads were obtained from (23, 14, 72). 
Ultra-long nanopore reads were aligned to 
the CHM13 reference (4) with Winnowmap 
version 2.0 (73) with a k-mer size of 15. BAM 
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files were filtered for primary alignments with 
SAMtools (version 1.9); analysis of centromeric 
regions was done on reads >50 kb. To measure 
CpG methylation in nanopore data, we used 
Nanopolish (version 0.13.2) with a log-likelihood 
ratio (LLR) cutoff of -1.5/1.5 (84). HG002 bi- 
sulfite FASTQs were collected from the Oxford 
Nanopore Technologies (ONT) open data re- 
pository https://labs.epi2me.io/gm24385-5mce. 
Paired-end FASTQs were aligned with Bismark 
(version 0.22.2) (74). For Nanopolish to Mega- 
lodon comparisons, Megalodon was run with 
the r9.4.1_450bps 5mC model with threshold- 
ing set as default. 


NanoNOMe 


HG002 cells were grown in culture and treated 
according to methods outlined in (J6). Purified 
genomic DNA was prepared for nanopore se- 
quencing following the protocol in the genomic 
sequencing by ligation kit LSK-SQK109 (ONT). 
To measure CpG and GpC methylation in nano- 
pore data, we used Nanopolish (version 0.13.2) 
on the nanonome branch hitps://github.com/ 
jts/nanopolish/tree/nanonome (34). We set an 
LLR threshold of -1/1 for GpC methylation calls 
and -1.5/1.5 for CpG methylation calls. 


Methylation clustering 


Methylation clustering was performed across 
the CHM13 X chromosome on all CGIs that 
overlap an annotated promoter of a protein- 
coding gene. Within the CGI, reads with an 
average methylation >0.2 were considered meth- 
ylated, and reads with an average methylation 
<0.2 were considered unmethylated. Reads 
were only considered if they spanned the 
entirety of the CG islands and were longer 
than 5 kb. Clustered reads were then inter- 
sected with known escape and XCI genes from 
(51). The same clustering procedure was per- 
formed at the DXZA4 locus. 


CUT&RUN 


CUT&RUN was performed as detailed in (75) 
with some variations. For library preparation, 
NEBNext Ultra II End repair/A-tailing and 
ligation kits were used as indicated by the 
manufacturer, with 1.5 pg of Spike-in Yeast 
DNA added (obtained from the Henikoff labo- 
ratory at the Fred Hutchinson Cancer Research 
Center). Marker-assisted mapping of CUT&RUN 
data (CHM13 CENP-A, CHM13 H3K4me2, 
CHM13 H3K27me3, HGO02 CENP-A, and 
HG002 CENP-B) to a sample-specific reference 
[CHM13 to T2T-CHM13 or HG002 to CHM13 
autosomes (chromosomes 1 to 22), HGO0O2 T2T 
(chromosome X), and GRCh38 (chromosome Y)] 
was performed according to the methods out- 
lined in (56). 


ENCODE dynamic k-mer-assisted mapping 


We selected several ChIP-seq datasets generated 
as part of the ENCODE project (7), choosing 
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those with at least 100-bp paired-end sequenc- 
ing data and at least one matching input con- 
trol. These criteria yielded 96 total sequencing 
libraries (table S9). Reads were mapped with 
Bowtie2 [version 2.4.1 (76)], alignments were 
filtered using SAMtools [version 1.10 (77)], and 
polymerase chain reaction duplicates were 
identified and removed with Picard tools [ver- 
sion 2.22.1 (http://broadinstitute.github.io/ 
picard)]. Alignments were then filtered for 
unique k-mers. Specifically, for each alignment, 
reference sequences aligned with template 
ends were compared with a database of k-mers 
unique in the whole genome. For each end 
of the paired-end sequencing reads, the k-mer 
length was determined by finding the largest 
multiple of 5 less than or equal to the aligned 
reference sequence length. Peak calls were 
made using MACS2 (version 2.2.7.1) (78) with 
default parameters and estimated genome sizes 
of 3.03 x 10° and 2.79 x 10° for chm13v1 and 
GRCh38p13, respectively. 
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PALEONTOLOGY 


Body first 


ammals have the largest ratio of brain to body size 

(encephalization) among vertebrates. It has been 

believed that this relationship emerged early on in 

mammalian evolution, with enlarging brains leading the 

way into new and diverse forms. However, Bertrand et 
al. looked at encephalization rates across mammals beginning 
in the Paleocene and found instead that body sizes were the 


first to increase, allowing for niche filling after the extinction 
of the dinosaurs (see the Perspective by Smith). It was only 
later, in the Eocene, that brain size began to increase, likely 
driven by a need for greater cognition in increasingly complex 
environments. This led to the highly encephalized brains of 
today, including those of humans. —SNV 


Science, abl5584, this issue p. 80; see also abo1985, p. 27 


An artist’s conception of an early mammal with a relatively large body relative to brain size 


GAMMA-RAY ASTRONOMY 
Proton acceleration in a 
recurrent nova 


If a white dwarf strips material 
from a companion star, then 
enough hydrogen can build 

up on the surface to trigger 

a thermonuclear explosion, 
ejecting material without 
destroying the white dwarf. 
This is observed as a nova. 
Some novae have been seen to 
emit high-energy gamma rays, 
but the origin of that emission 
has been unclear. The H.E.S.S. 
Collaboration observed the 
2021 outburst of RS Ophiuchi, 
a recurrent nova, determining 
its spectral and temporal evo- 
lution at giga—electron volt and 


SCIENCE science.org 


tera—electron volt energies. 
Modeling of the emission phys- 
ics shows that the expanding 
nova shock wave efficiently 
accelerated protons, providing 
a source of gamma rays and 
cosmic rays. —KTS 

Science, abn0567, this issue p. 77 


CHRONIC PAIN 
Pain, pain, go away... 


Damage to the nervous 
system pathologically alters 
the somatosensory system, 
which can result in neuro- 
pathic pain. Although pain 
development has been well 
studied, the mechanisms 
orchestrating pain recovery 
remain unclear. Working in 


nerve-injured mice, Kohno et 
al. identified a CD11c* spinal 
microglia population that 
appears after pain develop- 
ment (see the Perspective 

by Sideris-Lampretsas and 
Malcangio) and is essential 
for recovery from neuropathic 
pain. The ability of these cells 
to promote pain recovery 
relied on their high levels 

of expression of insulin-like 
growth factor-l. The CD11c* 
microglia remained even after 
pain recovery, and pain hyper- 
sensitivity returned if they 
were depleted. These find- 
ings elucidate mechanisms 
underlying remitting and 
relapsing neuropathic pain 
and could potentially help in 


the development of therapeutic 
strategies. —SMH 
Science, abf6805, this issue p. 80; 
see also abo5592, p. 33 


HUMAN EVOLUTION 
5000 years of Xinjiang 


genetics 

The Xinjiang region of China 

is bordered by mountains 

and represents an important 
historical region. Sampling 
ancient genomes, Kumar et 

al. investigated the changes 

in populations of this region 
over time from the Bronze Age, 
~5000 to 3000 years before 
present (BP), covering the Iron 
Age, ~3000 to 2000 years BP, 
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and into the Historical Era, 
~2000 years BP. This analysis 
found that older individuals 
represented ancestries from 
Steppe cultures, and that a later 
inflow of East and Central Asian 
ancestry entered the region 
around the end of the Bronze 
Age toward the beginning of the 
lron Age. During the Historical 
Era, mixing continued but 
retained a core Steppe compo- 
nent such that populations form 
a genetic continuum. This reten- 
tion of genetic continuity ina 
central population is surprising 
because it represents patterns 
more typically observed in iso- 
ated populations. Furthermore, 
these genetic links identify a 
previously unknown lineage that 
could potentially explain the 
spread of the Indo-European 
anguages. —LMZ 

Science, abk1534, this issue p..62 


Nanoneedle-based ocular 
drug delivery 


Traditional ocular drug delivery 
commonly requires topical 
administration, which has low 
delivery efficiency because of 
the complex eye environment. 
Large drug doses or frequent 
topical applications could lead 
to a high risk of side effects. 
Park et al. report a tear-soluble 
contact lens platform inte- 
grated with silicon nanoneedles 
for ocular drug delivery. The 
contact lens quickly dissolves 
in tear fluid and releases anti- 
inflammatory drugs, and the 
silicon nanoneedles can pen- 
etrate into the cornea painlessly 
and degrade gradually to realize 
long-term sustained therapeutic 
drug delivery. This platform was 
demonstrated in a rabbit model 
to treat a chronic ocular disease 
with reduced side effects 
compared with current standard 
approaches. —WG 
Sci. Adv. 10.1126/ 
sciadv.abn1772 (2022). 


Young lung inflammation 
Chorioamnionitis is a potentially 
fatal inflammatory condition of 
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the fetus caused by inflam- 
mation of the chorion and/or 
amnion that is linked to a large 
fraction of preterm births and 
an elevated risk for respiratory 
disease in childhood and adult- 
hood. Toth et a/. describe the 
development of a prenatal rhe- 
sus macaque model that used 
intra-amniotic lipopolysaccha- 
ride (LPS) challenge to induce 
experimental chorioamnionitis. 
This model mirrored struc- 
tural and temporal changes 
consistent with prenatal 
human lung development. LPS 
challenge resulted in extensive 
damage to alveolar structures 
that resembled chronic lung 
disease in newborns, and 
single-cell RNA sequencing 
analysis showed dramatic 
disruption of alveologenesis 
signals. Intrauterine blockade 
of interleukin-18 and tumor 
necrosis factor-a ameliorated 
inflammatory responses and 
restored lung integrity. 
—CNF 

Sci. Transl. Med. 14, 

eabl8574 (2022). 


Pulling carbon dioxide 


out of the air 


A challenge in the design of 
polymeric membranes for gas 
separation is the trade-off 
between permeability, or how 
fast gases can flow through 
the membrane, and selectivity, 
the ability to separate one gas 
from another. In general, the 
more selective the membrane, 
the more slowly gases can 
flow through it. Sandru et 
al. overcame this trade-off 
through a layered design. They 
used a bottom layer of porous 
polyacrylonitrile that acts as a 
physical support for the middle 
ayer of either elastomer-like 
polydimethylsiloxane or glassy- 
type polytetrafluoroethylene. 
The authors then grafted a 
patchy layer of polyvinylamine, 
which selectively attracts car- 
bon dioxide, thus pulling it into 
the membrane and leading to 
much higher separation from 
nitrogen. —MSL 

Science, abj9351, this issue p. 90 
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and Jesse Smith 


PLANT ECOLOGY 


Precision pollination 


lants that rely on pollinators for 
fertilization have evolved ways of 
making pollination more efficient. 
Stewart et al. tested whether flower 
symmetry, petal fusion, and stamen 
number affect the placement of pollen on 
pollinators’ bodies. They observed gener- 
alist and specialist pollinators (including 
native honeybees and nectar bats) visiting 
flowers in Thailand, and then measured 
pollen deposited on different areas of the 
body. Flowers with bilateral symmetry, 
fused petals, and fewer stamens depos- 
ited pollen on specific areas. Thus, the 
bilateral flower structure helped the preci- 
sion and efficiency of pollination across 


pollinator taxa. —BEL 


New Phytol. 10.1111/nph.18050 (2022). 


The bat Eonycteris spelaea visiting bilaterally 
symmetrical Oroxylum indicum flowers, which 
have fused corollas, allowing precise deposition 


of pollen on the bat’s head. 


Baby steps toward 
protection 


Respiratory syncytial virus 
(RSV) is a major cause 

of lower respiratory tract 
infections in young children, 
presenting a particular danger 
to premature babies. The 
only specific intervention is 
palivizumab, an antibody that 
is given monthly to high-risk 
infants, but it only offers 
partial protection from RSV. 
Hammitt et al. report on the 
results of clinical testing of 
nirsevimab, another monoclo- 
nal antibody. Unfortunately, 
the authors did not compare 
nirsevimab directly with 
palivizumab, but its once per 
season dosing and efficacy 
compared with placebo make 
it a promising new contender 
for RSV prevention. —YN 


N. Engl. J. Med. 386, 837 (2022). 


Predicting ion mobility 
in solids 


There is considerable interest in 
detecting correlations between 
physicochemical properties of 
materials tailored for certain 
applications. Such correlations, 
if determined to be relevant 

to the underlying functionality 
and to be applicable to a broad 
range of materials, can become 
a major tool in computational 
screening for optimized practi- 
cal solutions. Using density 
functional theory calculations, 
Sotoudeh and GroB derived a 
universal and reliable descrip- 
tor for ion mobility in crystalline 
solids, a critical performance 
parameter for electrochemical 
energy storage devices. The 
proposed descriptor exhibits 
high predictive power and is 
based on scaling relations 
between the materials’ features 
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that are readily available and 

capture the key factors influenc- 

ing ion mobility in solids. —YS 
JACS Au 2, 463 (2022). 


QUANTUM LOGIC 
Solving a puzzle with 
quantum logic 


Swiss mathematician Leonard 
Euler devised a sudoku-style 
mathematical puzzle in 1779 
that has been mathematically 
proven to be unsolvable. The 
puzzle consists of a6 x 6 board 
and 36 “army officers,” with 
each officer belonging to one 
of six different regiments (e.g., 
A,B, C...), and also having one 
of six different ranks (e.g., 1, 2, 
3...). Rather et al. created and 
solved a “quantum” version 

of this puzzle in which each 
officer can assume a super- 
position of multiple regiments 
and ranks, e.g., partially A and 


SCIENCE science.org 


C and partially 2 and 4. The 
quantum version of this puzzle 
entangles the properties of 
the 36 elements and provides 
an elegant way to explore 

the nascent field of quantum 
combinatorics. It will also be 
interesting to find out if there 
is something special about the 
size of the 6 x 6 board. —YY 


Phys. Rev. Lett. 128, 080507 (2022). 


METAL NANOPARTICLES 
Surface ligands and 
electronic density 


Metal nanoparticles are 
normally stabilized by organic 
surface ligands, but how these 
ligands affect the electronic 
density of states of nanopar- 
ticles in solution has been 
difficult to determine experi- 
mentally. Litak et al. assessed 
changes in the Fermi level 

(E,) of alkanethiol-coated 


gold nanoparticles with a 
differential nuclear magnetic 
resonance using different 
solutions in coaxial sample 
tubes. Chemical shifts in 
hexane peaks in the outer 
tube that occurred when the 
nanoparticles were present 
relative to solvent in the inner 
tube measured the induced 
Pauli paramagnetic suscepti- 
bility and in turn E,. Magnetic 
susceptibility was lower with 
longer alkyl chain ligands that 
donated more charge across 
the gold-sulfur bond and in 
turn pushed E, higher into the 
conduction band. —PDS 


ACS Nano 16, 4479 (2022). 


INFECTION 
Bacterial spanner 
in the works 


How the mammalian immune 
system coordinates responses 
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to all of the microbes that it 
encounters simultaneously 
remains an open question. In 
mice, Biram et al. report that 
innate immune responses 
to some pathogens can 
disrupt ongoing adaptive 
immune responses to others. 
Infection with bacteria such 
as Salmonella Typhimurium 
or Listeria monocytogenes 
triggers the recruitment of 
Scal* monocytes to lymph 
nodes, where their anti- 
microbial activity disrupts 
cell respiration. B cells in 
germinal centers, which are 
crucial for antibody produc- 
tion, then die off because their 
specialized metabolisms make 
them vulnerable. Bacterial 
ipopolysaccharide alone 
was sufficient to trigger the 
production of monocytes and 
collapse germinal centers, 
indicating that the use of 
bacterial adjuvants in vaccines 
may in many cases do more 
harm than good. —STS 
Immunity 55, 442 (2022). 


HIV 


Traveling with HIV 
Why do test-and-treat HIV 
prevention trials have such 
disappointing effects on overall 
HIV prevalence? Regions in 
eastern and southern Africa 
still have very high levels of HIV 
infections despite significant 
public health efforts. Magosi 
et al. assessed test-and- 
treat control measures in 30 
communities in Botswana. 
Deep-sequencing and phylo- 
genetic analysis showed that 
most transmission occurred 
among people roughly the 
same age within a community 
or from an adjacent commu- 
nity. More HIV was imported 
into treatment communities 
than was exported. Therefore, 
for HIV, as for other pathogens, 
movement of infected individu- 
als can play a pivotal role in 
ong-term persistence. The 
work also shows that although 
focused public health efforts 
may prevail within a com- 
munity, these efforts may be 
rather easily undermined. —CA 
eLife 11, e72657 (2022). 
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SOLAR CELLS 
Stabilizing inverted 


solar cells 


Although inverted (p-i-n) 
perovskite solar cells (PSCs) 
have advantages in fabrication 
and scaling compared with 
n-i-p cells, their power conver- 
sion efficiencies (PCEs) are 
usually lower. Azmi et al. show 
that by tailoring the number 
of octahedral inorganic sheets 
in two-dimensional perovskite 
(2DP) passivation layers for 
three-dimensional perovskite 
active layers, PCEs of more 
than 24% could be achieved 
(see the Perspective by Luther 
and Schelhas). The 2DP layers 
formed with oleylammonium 
iodide molecules at the electron- 
selective interface passivated 
trap states and suppressed ion 
migration. These PSCs retained 
more than 95% of their initial 
efficiency after 1000 hours of 
damp-heat testing (85°C and 
85% relative humidity), which 
passes a key industrial stability 
standard. —PDS 

Science, abm5784, this issue p. 73; 

see also abo3368, p. 28, 


COGNITIVE SCIENCE 
Humans succeed 
through social learning 


Our capacity to accumulate 
complex algorithms over genera- 
tions allows human beings to 
adapt to diverse environments 
and solve challenges that go 
beyond our individual limitations. 
However, cultural accumula- 

tion of innovative algorithms is 
difficult to explain. Thompson 

et al. studied a large number 

of participants to explore the 
evolution of algorithms under 
different learning conditions (see 
the Perspective by Henrich). 
Selective social learning that 
involved knowledge of the suc- 
cess level of different strategies 
or of different models preserved 
difficult-to-invent, efficient 
algorithms more than random 
social learning or one-attempt 
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asocial learning. Two efficient 
algorithms were used by many 
people, but the most efficient 
one only spread under selective 
social learning. —PRS 
Science, abn0915, this issue p. 95; 
see also aboO713, p.31 


T CELLS 


Checks and balances 
Regulatory T cells (T,,,.) are 
critical for maintaining immune 
tolerance, but their suppressive 
functions must be temporarily 
overcome to generate produc- 
tive immune responses against 
pathogens and tumors. Teh et 
al. demonstrate that caspase-8 
regulates the homeostasis of 
Tregs by tuning their susceptibil- 
ity to inflammatory cell death. 
Caspase-8-deficient Tra were 
protected against apoptotic 
cell death during homeostasis 
but were more susceptible to 
necroptotic cell death during 
chronic viral infection. Human 
Tree were also more sensitive to 
necroptosis induced by caspase 
inhibition than conventional 

T cells. Although caspase-8 in 
Thegs functioned to prevent lethal 
inflammation, it also antago- 
nized antiviral and antiparasitic 
immunity, highlighting a key 
regulatory mechanism in con- 
trolling the balance between 
immunity and pathologic inflam- 
mation. —CO 


Sci. /mmunol. 7, eabn8041 (2022). 


NEUROSCIENCE 
Palmitoylation for axonal 
degeneration 


In response to nerve injury, 
activation of the kinases DLK 
and JNK3 in axons induces 
retrograde signaling to the soma, 
where gene expression changes 
promote neuronal degeneration. 
Niu et al. found that palmi- 
toylation of both DLK and JNK3 
mediated their colocalization 
onto axon vesicles, where a 
positive-feedback phosphoryl- 
ation loop between the kinases 
maintained their activation state 
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as the vesicles trafficked to the 
soma. Moreover, JNK3 palmi- 
toylation was essential for axonal 
retrograde signaling in response 
to optic nerve crush injury in 
mice. —LKF 

Sci. Signal. 15, eabh2674 (2022). 


CHEMISTRY 
Computationally designed 
pesticides 


Computational modeling of the 
properties of small-molecule 
pesticides promises to optimize 
performance while minimiz- 
ing ecotoxicity. Kostal et al. 
developed an in silico strategy 
to design safe pesticides for 
which parameters for aquatic 
toxicity and photodegradation 
rate are considered. The safest 
pesticide is less bioavailable 
and more readily degraded 
photochemically. Focusing 
on phenols and anilines and 
analyzing 700 known com- 
pounds, they concluded that 
phenol derivatives are the safest 
pesticides and propose further 
structural optimizations based 
on their computed properties. 
The authors make a compelling 
case that their computational 
approach should be used in the 
rational design of potent and 
safer pesticides. —SRC 
Sci. Adv. 10.1126/ 
sciadv.abn2058 (2022). 


ECOLOGY 
Genes to ecology 


In the past few decades, the 
identification of keystone spe- 
cies, that is, those with essential 
roles in structuring a community 
or ecosystem, has increased 
across systems. Barbour et al. 
extended this concept to genes, 
showing that a single allele of a 
particular plant defense gene 
facilitates species coexistence 
across a small experimental tro- 
phic system (see the Perspective 
by Nosil and Gompert). 
Specifically, plants with this 
allele grew faster, supporting 
larger populations of two species 


of herbivores and their preda- 
tors. This finding suggests that 
genotype variation can play a 
role in the structure and function 
of organismal systems. —SNV 
Science, abf2232, this issue p. 70; 
see also abo3575, p.30 
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HUMAN EVOLUTION 


Bronze and Iron Age population movements underlie 


Xinjiang population history 


Vikas Kumar", Wenjun Wang’, Jie Zhang*, Yongqiang Wang‘, Qiurong Ruan’, Jianjun Yu‘, 
Xiaohong Wu®, Xingjun Hu*, Xinhua Wu®, Wu Guo®, Bo Wang’, Alipujiang Niyazi*, Enguo Lv‘, 
Zihua Tang®, Peng Cao’, Feng Liu’, Qingyan Dai, Ruowei Yang’, Xiaotian Feng’, Wanjing Ping’, 
Lizhao Zhang’, Ming Zhang’, Weihong Hou’, Yichen Liu'*, E. Andrew Bennett'*, Qiaomei Fub?:?* 


The Xinjiang region in northwest China is a historically important geographical passage between East 
and West Eurasia. By sequencing 201 ancient genomes from 39 archaeological sites, we clarify the 
complex demographic history of this region. Bronze Age Xinjiang populations are characterized by four 
major ancestries related to Early Bronze Age cultures from the central and eastern Steppe, Central 
Asian, and Tarim Basin regions. Admixtures between Middle and Late Bronze Age Steppe cultures 
continued during the Late Bronze and Iron Ages, along with an inflow of East and Central Asian ancestry. 
Historical era populations show similar admixed and diverse ancestries as those of present-day Xinjiang 
populations. These results document the influence that East and West Eurasian populations have had 


over time in the different regions of Xinjiang. 


injiang in northwest China has played a 

central role in the exchange of material 

culture, agriculture, and technology 

between the West and East Eurasian 

peoples (J-4). The Xinjiang region is 
surrounded by the Altai Mountains to the 
north and the Kunlun and Pamir mountain 
ranges to the south. The Tianshan Mountains 
in the central west region separate Xinjiang 
into the Zungharian and Tarim Basins, con- 
sisting mostly of arid semidesert with habitable 
regions around the rivers providing fertile land 
for farming (Fig. 1) (2, 3, 5, 6). In the Bronze Age 
(BA), metallurgical technologies traveled through 
Xinjiang to East Asia, agriculturally important 
plants such as wheat and barley [~5000 before 
the present (B.P.)] traversed from the west 
through the Inner Asian Mountain Corridor 
(IAMC), and millet (~4000 B.P.) is presumed 
to have entered Xinjiang from the east through 
the Hexi Corridor (3-5, 7). An understanding of 
the origins of the BA inhabitants of Xinjiang 
is necessary to follow the changes that these 
cultural and technological transfers have had 
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on the population structure over the ensuing 
millennia. 

Several hypotheses have been proposed re- 
garding the origins of the BA populations of 
Xinjiang, which have been linked to migrations 
from the influential cultures surrounding the 
region. To the northwest were the Steppe 
cultures of the Afanasievo, Chemurchek, Okunevo, 
and Botai; to the south was the Central Asian 
Oxus or Bactria-Margiana Archaeological Com- 
plex (BMAC) civilization; and to the east, the 
Siba culture was present around the Hexi Cor- 
ridor (2, 8-72). Within this context, archaeolog- 
ical and mitochondrial studies have suggested 
that the BA inhabitants and cultures of Xinjiang 
were not derived from any indigenous Neolithic 
substrate but rather from a mix of West and 
East Eurasian people (5, 13, 14), whereas BA 
burial traditions suggest links with both North 
Eurasian Steppe cultures and the Central Asian 
BMAC civilization (15). Linguistically, the pres- 
ence of the now-extinct Indo-European Tocharian 
language, attested to in ~5th to 10th century 
CE texts from the Tarim Basin, also raised 
questions regarding the origin and extent of 
Indo-European-speaking people in Xinjiang 
(7, 15). To date, one mitochondrial study of 
the BA Xiaohe site in the Tarim Basin found 
evidence of both Steppe- and central Siberian- 
derived haplogroups and only a later influence 
from the BMAC (3), whereas a broader mito- 
chondrial survey indicated that these results 
may not apply to the region as a whole (J4). A 
recent genomic analysis of the BA Tarim Basin 
mummies found evidence of a local ancestry 
derived in part from ancient North Eurasians 
(ANEs) (6). Thus, to gain a more complete 
picture of the BA settlement of Xinjiang, in-depth 
exploration using the more precise analysis of 


nuclear DNA across a geographically compre- 
hensive collection of BA sites will be needed. 

The Iron Age (IA) of East and West Asia is 
marked by extensive population movements, 
genetic admixtures, and cultural changes 
C1, Ul, 14, 17). In Xinjiang, the first iron ma- 
terials, which date from before the first mil- 
lennium (~3150 B.P.), were linked to Steppe 
nomads (5), such as the Sakas, or to the 
Scythians, a broad and important IA nomadic 
culture whose presence has been reported in 
many archaeological sites of the Yili River 
Valley in northwest Xinjiang and the southern 
Tarim Basin (5). Many nomadic confedera- 
tions, such as the Saka, Hun, Pazyryk, Xiongnu, 
and Tagar, arose in regions surrounding Xinjiang 
during the IA (/, 3). The people of these nomadic 
cultures maintained a high diversity in IA 
Xinjiang (73, 14). Of these, the Sakas were the 
descendants of Late Bronze Age (LBA) herders 
(such as the Andronovo, Srubnaya, and Sintashta) 
with additional ancestries derived from Lake 
Baikal (Shamanka_EBA) (EBA, Early Bronze 
Age) and BMAC populations (J, 17, 18). Sakas 
have been associated with the Indo-Iranian 
Khotanese language, which was spoken in 
southern Xinjiang before spreading to other 
parts of the region (9). A previous genomic 
study from a single IA site has suggested the 
presence of Steppe-related ancestry in IA Xinjiang 
individuals (20); however, the high level of ge- 
netic variability and mobility reported for the 
broader region highlights the necessity of a 
broader investigation for a comprehensive 
understanding of the IA Xinjiang populations. 
The periods after 2200 B.P. oversaw several 
noteworthy struggles for control of the region 
by neighboring powers, such as the Yuezhi, 
Xiongnu, Han, and Turks (/, 3, 17). Thus, Xinjiang 
represents a key area for studying the past 
confluence and coexistence of populations 
with dynamic cultural, linguistic, and genetic 
backgrounds. To track the demographic changes 
of the Xinjiang population over the cultural 
and technological transitions during the past 
5000 years, we generated and analyzed genome- 
wide data from 201 archaeological samples, 
collected from 39 sites across Xinjiang and 
dating from the BA and IA to the historical 
era (HE). 


Results 


We collected archaeological samples from 39 
sites and generated genome-wide data from 
201 ancient individuals; 104 of these individ- 
uals were directly radiocarbon dated, whereas 
dates were inferred using archaeological infor- 
mation for the remaining 97 (Fig. 1 and table 
S1) (22). DNA was extracted from bones, and 
libraries were constructed using previously 
described protocols (22, 23). Samples that 
passed ancient DNA filters for DNA damage 
and low contamination rates (<5%) were used 
for analyses (table S1; details are given in the 
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supplementary materials) (27, 24). Genomic 
libraries were enriched to target ~1.2 million 
single-nucleotide polymorphims (SNPs) for hu- 
man DNA (25), generating a mean coverage 
ranging from 0.01x to 8.60x. Later, pseudo- 
haploid genotypes were called on the targeted 
SNPs, resulting in ~5000 to 1,141,000 SNPs 
(table S1) (27). Additionally, among the 201 
individuals, 27 with adequate endogenous 
DNA content were shotgun sequenced to a 
coverage of 0.22x to 3.60x, resulting in ~232,000 
to 1,077,000 SNPs (table S1) (22). We sampled 
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across Xinjiang from north (n = 40), west (7 = 
105), south (7 = 49), central (7 = 3), east (7 = 3), 
and unknown subregions (n = 1). The sequenced 
individuals were from either the BA (Xinj_BA), 
~5000 to 3500 B.P.; the LBA (Xinj_LBA), ~3500 to 
3000 B.P.; the IA (Xinj_IA), ~3000 to 2000 B.P.; 
or the HE (Xinj_HE), <2000 B.P. (table S1) 
(21). After removing five individuals for high 
contamination, kinship testing identified 87 
pairs of related individuals among all samples 
(table S2) (27). For the related individuals, only 
those with the higher number of SNPs (7 = 43) 
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Fig. 1. Xinjiang sampling 
locations along with their 
groupings used in this study. 
(A) Map showing the geographic 
sampling locations of the 
archaeological sites included in 
this study. (B) BA, IA, and HE 
time periods are depicted on a 
timeline in years before the 
present (BP). The number of 
individuals from each site 

and time period is given in 
parentheses. Abbreviations are 
defined in table S1 and the 
supplementary materials. 
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were used for further analysis, which resulted 
in 152 unrelated individuals after discarding 
two with <30,000 SNPs (table S1) (27). After 
analyzing the newly sequenced individuals 
along with published ancient and present-day 
populations using principal components anal- 
ysis (PCA), ADMIXTURE (26), f-statistics (27), 
qpAdm (28), and DATES (/7), we observed that 
these individuals were highly admixed, and many 
contained unique ancestries. On the basis of 
these observations, we also regrouped individ- 
uals from the 39 archaeological sites into 64 
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subgroups for the follow-up analyses, primarily 
using PCA and ADMIXTURE to identify genet- 
ically homogeneous individuals, including those 
found across neighboring sites and similar time 
periods (table S1) (27). 


BA Xinjiang shows a mix of eastern and 
western Steppe ancestry 


Attempts to explain the origins of BA Xinjiang 
populations have focused on similarities with 
surrounding Steppe and Central Asian pop- 
ulations. Two major competing hypotheses 


have been proposed to describe the first BA 
settlers of the region. The Steppe hypothesis 
states that the Tarim Basin was settled by the 
Afanasievo Steppe culture to the north and 
finds support from similarities between cultural 
artifacts, burial customs, and skeletal charac- 
teristics (5, 13, 14). The Bactrian oasis hypoth- 
esis highlights the similarities in the desert 
basin environment and sustenance practices 
with the BMAC culture to the west of Xinjiang 
across the Pamir and Tianshan mountain 
ranges, connected through the IAMC (J5). 


Additional archaeological evidence also suggests 
East Asian connections from the Gansu and 
Qinghai (GanQing) regions of northern China 
through the Hexi Corridor (3). We investigate 
the origins of BA Xinjiang populations through 
genome-wide data of 20 BA individuals from 
six sites merged with published BA individuals 
(6) and an additional seven LBA individuals 
(<3500 B.P.) from five sites, all in northern and 
western Xinjiang (Fig. 1 and table S1) (27). We 
observe a high affinity between BA and LBA 
Xinjiang populations and those with ANE ancestry 
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Fig. 2. PCA and ADMIXTURE analyses of Xinjiang populations. (A) Ancient 
Xinjiang and other ancient populations are shown as different colors and shapes. 
Present-day populations are shown as gray circles, and only major groups are 
included. Most of the ancient Xinjiang populations lie on the cline extending from 
European and Siberian to East Asian populations. The published populations 

of Dzungaria_EBA, Tarim_EMBA, and Shirenzigou_IA are depicted in various black 
shapes. IA north (IA_N), south (IA_S), west (IA_W), and east (IA_E) are the 
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geographical locations of the IA individuals. (B) ADMIXTURE analysis of all the 
newly reported ancient individuals at K = 7. The four major components are 
maximized in the following populations: ANE (green), Iranian farmer (red), 
Anatolian farmer (violet), and East Asian hunter gatherer (yellow). The other 
three are maximized in Han (orange), Mixe (cyan), and Papuan (dark blue). 
Supplementary figures of PCA and ADMIXTURE show all the present-day and 
ancient populations (21). 
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in both PCA and outgroup /3 analysis, as 
represented by the 24,000-year-old Mal’ta 
and the 17,000-year-old Afontovogora3 (AG3) 
individuals, whose ancestry is also present in 
the Steppe Early to Middle Bronze Age (EMBA) 
(namely Yamnaya and Afanasievo) and western 
Siberian hunter gatherers (WSHGs) (Fig. 2A and 
figs. S1, S2, S4, and S9) (27). 

In addition to the ANE ancestry, we fur- 
ther identified three other major ancestry 
components—East Asian hunter gatherers 
and Iranian and Anatolian farmers—using 
ADMIXTURE (K = 7, where K indicates the 
number of subpopulations making up the 
total population) models of BA Xinjiang in- 
dividuals (Fig. 2B and figs. S4, S7, and S8) 
(21). To quantify the proportions of different 
ancestral sources, we used a qpAdm model- 
ing analysis (28). Similar to the admixture 
results, the BA Xinjiang populations are com- 
posed of four major sources related to ANEs 
(27 to 91%), Anatolian (8 to 25%) and Iranian 
farmers (14 to 26%), and East Asians (9 to 
73%) (fig. S29 and tables S3 and S4) (27). 
Additionally, models with more contempo- 
raneous potential ancestral sources portray 
BA populations from northern and western 
Xinjiang as a mixture of four major ancestry 
sources derived from ancestry present in the 
BA Tarim Basin mummies at the Xiaohe site 
Xinj_BA1_TMBAI (8 to 85%) (76), Afanasievo 
(57 to 100%), Shamanka_EBA (10 to 92%), 
and Gonurl_BA (BMAC) (~43%) (Fig. 3A and 
tables S5 and S6) (21). The genetic connection 
with Steppe EMBA populations is also validated 
by 10 individuals, 67% carrying Afanasievo- 
associated Rib1 Y-Haplogroups (table S1 and 
fig. S26) (7, 2D). 

The presence of deep ANE ancestry ob- 
served in BA Xinjiang populations (Fig. 3A, 
table S5, and fig. S14) (27) may be traced to 
Tarim_EMBA1 ancestry, ~72% of which could 
be derived from the Upper Palaeolithic indi- 
vidual AG3 (16). ANE ancestry in present-day 
populations from South America (e.g., Surui 
and Karitiana), Europe, and Siberia may ex- 
plain the affinity for BA Xinjiang with these 
populations, as shown in the PCA and outgroup- 
38 results (figs. $3, S10, and S13) (14, 27). Although 
most of the BA Xinjiang populations showed 
a high affinity with each other—demonstrating 
a degree of homogeneity in the region at this 
time, as shown using outgroup-/3 statistics 
(fig. S12) (21)—several individuals exhibited 
more diverse ancestries. These individuals 
acted as outliers in our dataset, and they were 
likely to have represented less-common ances- 
tries in the sampled region or a high degree of 
mobility of individuals or smaller groups. 

One such individual from the Songshugou 
site in northern Xinjiang Xinj_BA7_oEA, radio- 
carbon dated to 5043 to 4861 B.P., revealed an 
East Asian affinity in PCA and ADMIXTURE, 
supported by /#-statistics (Fig. 2 and figs. S4 
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and S16) (22). In qudAdm models, Xinj_BA7_0EA 
could be modeled as a two-source admixture of 
Xinj_BA1 TMBA1 (~8%) and Shamanka_EBA 
(~92%) (Fig. 3A and tables S5 and S6) (2D, but it 
cannot be modeled using the Neolithic Mongo- 
lian ancestry (Mongolia_N_East) present in the 
Chemurchek_northAltai individuals (table S3) 
(7, 21). Estimating the timing of the admixture 
event using DATES, which calculates the timing 
of admixture events, resulted in dates between 
~6754 and 6041 B.P. (fig. S15) (22), which may 
place the Northeast Asian ancestry prevalent in 
the Baikal region in Xinjiang before the arrival 
of Steppe ancestry. Additionally, we identified 
an Afanasievo-related individual in the western 
Xinjiang Yili River Valley (Xinj_BA5_oSte) that 
overlaps directly with Steppe_EMBA Afanasievo 
and Yamnaya populations in the PCA (Fig. 2A 
and figs. S1, S2, and S4) (27) and shares more 
genetic drift with present-day European pop- 
ulations than with present-day Siberian pop- 
ulations (fig. S13) (27). In fact, we show this 
Steppe-related ancestry to be derived from 
Afanasievo and not from other western Steppe 
populations, as observed in /3-affinity tests 
(fig. S11) and f4-statistics (fig. S17) (21. When 
we estimated the admixture proportions with 
qpAdm, we observed that Xinj_BA5_oSte could 
be modeled as having an unadmixed Afanasievo 
ancestry (Fig. 3A and tables S5 and S6) (27). In 
addition to the outliers with East Asian and 
Steppe EMBA affinities, two individuals from 
the Chemurchek Culture of Chananguole (Xinj_ 
BA6_aBMAC), radiocarbon dated to 4352 to 
4096 B.P., showed high affinity with BMAC 
populations in PCA and ADMIXTURE (figs. S1 
and S4) as well as in a qpAdm model requiring 
~43% BMAC ancestry (tables S5 and S6) (27). 
This BMAC ancestry content is similar to that 
reported for two contemporaneous Chermurchek- 
associated individuals at the nearby site of 
Yagshiin Huduu in modern-day Mongolia (17), 
which demonstrates a continuity with southern 
Altai Chemurcheck populations and establishes 
BMAC ancestry in BA Xinjiang. DATES analy- 
sis estimates the Botai and BMAC admixture to 
have occurred ~5281 to 4575 B.P. The mean 
dates for the Afanasievo-related admixture of 
the BA Xinjiang samples are ~4877 to 4642 B.P. 
(fig. S15). This estimated timing of admixture 
would place Steppe-related people in northern 
and western Xinjiang at a similar time as when 
the Afanaseivo and Chemurchek cultures were 
developing near the Altai region. These results 
would be in line with the linguistic phylogenetic 
Tocharian split time of ~5000 B.P. (29) and may 
support the arrival of the Indo-European Proto- 
Tocharian language in Xinjiang earlier than the 
appearance of Indo-European languages in 
Europe (28) 

After the identification of the four major 
ancestry components of BA Xinjiang popula- 
tions (ANE, East Asian hunter gatherer, and 
Iranian and Anatolian farmer), we tracked the 


proportional changes of these ancestries over 
time. Steppe Middle to Late Bronze Age (MLBA) 
ancestry is characterized by the Andronovo and 
Sintashta pastoral cultures, which reached the 
eastern Steppe by at least ~3000 B.P., and ar- 
chaeological evidence has also suggested its 
presence in Xinjiang (3, 30). The Andronovo- 
and Sintashta-related ancestries differ from 
that of EBA herders by an increase in Anatolian 
farmer-related components coming from west- 
ern Eurasia (7). The appearance of these an- 
cestries in Xinjiang at this time would place 
this region within the MLBA expansion of 
these Steppe populations. According to PCA, 
ADMIXTURE, and /3- and /#-statistical results 
(Fig. 2 and figs. S18 and S19) (27), the seven 
LBA individuals from northern and western 
Xinjiang showed an increase of the Anatolian 
as well as Iranian farmer-related ancestries, 
similar to the Steppe_MLBA populations (17). 
These seven individuals are separated into four 
subgroups (table S1) (27) according to their 
genetic affinities. A high degree of Steppe_ 
MLBA (81 to 100%) is further observed in the 
working qpAdm models, with the rest of the 
ancestry being derived from Xinj_BA7_oEA 
or Shamanka-related people (7 to 12%) and 
BMAC (~12%) (Fig. 3A and table S7) (22. Ad- 
ditionally, a single Xinj_LBA3 population from 
west Xinjiang could be modeled using Steppe_ 
EMBA Afanasievo (~88%) and East Asian 
ancestry (~12%) (Fig. 3A and tables S7 and 
S8) (2, which suggests continuity between 
BA and MLBA populations, despite the influx 
of Steppe_MLBA ancestry. Notably, one indi- 
vidual from the west Yili Jirentaigoukou site, 
Xinj_LBA4, worked as a single-source qpAdm 
model with Sintashta, indicating greater af- 
finity with Sintashta than Andronovo and the 
possible presence of both Andronovo and 
Sintashta ancestries in Xinjiang (tables $7 
and S8) (21). 

These results, based on a broad sampling 
across the region, point to BA populations in the 
Tarim Basin containing a deep ANE-related 
ancestry (/6), with Northeast Asian ancestry 
similar to that from the EBA Lake Baikal re- 
gion appearing at least in the north and west. 
We also report additional movements and 
admixtures along the IAMC with Afanasievo- 
and BMAC-related populations, validating as- 
pects of both the Steppe and Bactrian oasis 
hypotheses for the settlement of Xinjiang. In 
the LBA, we find a continuation of existing 
genetic profiles with an additional influx of 
Steppe_MLBA ancestry. 


IA admixture with nomadic Steppe cultures 


The transition to the IA, which took place 
early in the first millennium BCE, is marked by 
the establishment of different nomadic groups 
around Xinjiang. This period also witnessed 
increased mobility leading to a greater con- 
nectivity between East and West Eurasian 
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Fig. 3. Inferred qpAdm models and summary of ancient Xinjiang with 
population movements. (A) The proximal qpAdm admixture proportions for 
all Xinjiang populations. Each bar represents admixture proportion of the 
listed subgroups for BA, LBA, IA, and HE populations. Subgroup details are 
provided in table S1 and the supplementary text, and qpAdm modeling 
results are provided in detail in tables S5 to S13 (21). The Swat Valley 
Protohistoric Grave Type IA populations are in the SPGT group, and the 
Yellow River basin Middle Neolithic population is in the YR_MN group. 


fe Age/ HE Xinjiang 


be mostly characterized with Steppe_EMBA and Xinj_BA1TMBAI1 (Tarim 
Basin EMBA) ancestries with additional ancestries of Central Asia (BMAC)—as 
observed in Chemurcheck culture (Steppe_EMA)—and Northeast Asia 
(Shamanka). MLBA Xinjiang populations contain additional Andronovo Steppe, 
Central Asian (BMAC), and East Asian ancestries, whereas the IA and HE 
populations show the major Xinjiang and Steppe MLBA ancestries with 
additional components from BMAC and East Asian (EA) sources, shown as a 
pie chart summarizing the qpAdm modeling of IA and HE populations using 


(B) Inferred scenarios of admixtures in BA, LBA, IA, and HE Xinjiang with 
possible population movements shown as arrows. Xinjiang BA populations can 


populations (5, 14, 31), which might have been 
facilitated by an increasingly widespread use 
of horses as well as the existence of several 
natural passes connecting Xinjiang to neigh- 
boring regions. The IAMC and Pamir region to 
the west of Xinjiang have been proposed as a 
BA route connecting Central Asia with the 
Tarim Basin in Xinjiang (15). Characterizing 
the extent of Central Asian BMAC-related an- 
cestry in Xinjiang over time would provide 
information about the development of the 
social and cultural connections between these 
regions. We explored the impact of these pro- 
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posed interactions on IA Xinjiang populations 
by analyzing 98 IA individuals throughout 
Xinjiang (Fig. 1 and table S1) (27). 

We initially examined the relationships be- 
tween Xinj_IA and Xinj_BA populations. In 
the PCA analyses, Xinj_IA individuals are not 
well-differentiated geographically, and they 
group closely with the Xinj_BA cluster lying 
along a west-to-east cline and surrounded by 
Steppe_MLBA and Steppe_EMBA population 
clusters (Fig. 2A and figs. S1, S2, and S5) (22). 
The IA Xinjiang populations show a coexistence 
of diverse ancestries in the region, having af- 


the Xinj_LBA population (table S13 and fig. S28) (21). Where possible, 
coloring corresponds to ancestry in (A). 


finities for Steppe, East Asian, or Central Asian 
populations, and are classified into 37 subgroups 
(Fig. 2A, fig. S5, and table S1) (27). Building 
on trends established during the MLBA, IA 
Xinjiang populations can be additionally char- 
acterized by increasing proportions of Anatolian 
and Iranian farmer-related ancestries with an 
additional increase of East Asian-related ances- 
try evident in ADMIXTURE and qpAdm results 
(Fig. 2A; figs. S5, S8, and S27; and tables S9 
and S10) (21). In outgroup-/3 tests, most of 
these individuals shared the most alleles with 
ancient Steppe-related populations, present-day 
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Europeans, and Siberians (figs. S9, S10, S21, and 
$22) (27). However, several individuals showed 
an increased East Asian affinity in ADMIXTURE 
and /3 statistics, supporting a substantial in- 
flux during the IA of East Asian ancestry into 
Xinjiang (figs. $5, S9, S10, S21, and S22) (27). 
To better define the genetic relationships 
between LBA and IA populations, we used 
Xinj_BA and Xinj_LBA as proximal sources 
in qpAdm models and observed that IA Xinjiang 
individuals could be modeled using Xinj_LBA1 
populations with additional ancestry from Xinj_ 
BA_TMBAI1, East Asian, or BMAC and Indus 
periphery sources, demonstrating a genetic 
continuity of Xinj_LBA ancestry into the IA, 
with additional genetic contributions from 
East and Central Asia (fig. S28 and table S13) 
(21). As in Xinj_LBA, the predominant ances- 
try observed in the IA is derived from three 
sources: Steppe MLBA (~55%), BMAC (~18%), 
and Shamanka EBA (~27%) represented by 
Xinj_IA1 (43 individuals from 18 archaeolog- 
ical sites) (Fig. 3A, fig. S5, and table S9). Other 
subgroups with two or more individuals ex- 
hibit an affinity (denoted by “_a”) for either 
East Asian (32 to 60%) or Steppe MLBA (24 to 
59%) populations, whereas in some working 
models, BMAC ancestry is replaced by Xinj_ 
BA1_TMBAI ancestry (13 to 26%) (Fig. 3A; fig. 
$28; and tables S9, S10, and S13) (27). Ad- 
ditionally, 21 outlier individuals (denoted by 
“ 0”) with unique ancestries as shown in PCA 
and ADMIXTURE plots either show high af- 
finity with one or more of the three basic sources 
listed above or contain unique ancestry repre- 
sented by Steppe EMBA or Saka populations, as 
shown in qpAdm models (Fig. 3A; fig. S28; and 
tables S9, S10, and S13) (27). 

Because we have identified an influx of 
Steppe_MLBA sources into Xinj_LBA popula- 
tions, we further studied the impacts of this 
core Steppe ancestry on Xinj_IA populations. 
To differentiate between Steppe_EMBA and 
Steppe_MLBA affinity of IA Xinjiang popula- 
tions, we used the known absence of Anatolian 
farmer-like ancestry in the EMBA Steppe pop- 
ulations. We performed this test using outgroup- 
78 statistics of the form, 3(Anatolia_Neolithic, X; 
Mbuti) and f3 (Kostenkil4, X; Mbuti), taking 
advantage of the lack of Anatolian farmer an- 
cestry in the Upper Paleolithic Kostenkil4 indi- 
vidual from Europe. The results supported a 
greater affinity of Xinj_IA to Anatolian farmers, 
linking the more recent Steppe_MLBA ancestry 
to the greater part of the Xinj_IA populations 
(fig. S18) (27). We also observed a cline of in- 
creasing Anatolian and Iranian farmer-related 
ancestry accompanied by a decrease in WSHG 
affinity in Xinj_IA compared with the Xinj_BA 
population using an /#-statistics-based approach 
(fig. S19) (2D. The qpAdm models also validated 
an increase in the Anatolian farmer ancestry 
component reaching up to ~43% in some of the 
Xinjiang populations and an Iranian farmer 
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component of ~38% (fig. S27 and tables S3 and 
S4) (21). The Steppe_MLBA ancestry found in 
the IA is further supported by the presence of 
the distinct Steppe_MLBA Y-haplogroup, Rial 
(n = 30), and greater affinity in outgroup-/3 
analysis to Steppe_MLBA than to BA popula- 
tions (figs. S9, S21, and S28) (21). We identified 
seven IA populations with Afanasievo ancestry 
in qpAdm models and three individuals overall 
carrying the Steppe_EMBA (Rib1) haplogroups 
(tables S1, S9, and S10). We also found models 
with Afanasievo and Andronovo ancestry in 
four IA populations (ABST_IA2 oIAl, Xinj_ 
IA8_aEA, Xinj_IA10_aEA, and Xinj_IA12_ 
aSte) with additional Shamanka_EBA ances- 
try (P > 0.01). The DATES results suggested 
that the admixture event in Xinj_IA10_aEA 
occurred between ~4351 and 3177 B.P., a broad 
range that will require sampling additional 
IA individuals to refine (fig. $15) (21). This 
continuity of core Steppe_EMBA ancestry may 
lend support to the persistence of BA Indo- 
European languages into IA Xinjiang (7). 

The IA also shows an increase in the fre- 
quency of BMAC ancestry in the /#-statistics 
comparisons (figs. S19 and S20) (27). Seven IA 
populations were found to contain BMAC an- 
cestry (30 to 47%), and we observed four IA 
populations that could also be modeled using 
Indus periphery ancestry sources SPGT and 
two with Gonur_2BA (18 to 37%) (Fig. 3A and 
tables S9 and S10). The increase in the appear- 
ance of BMAC ancestry suggests a substantial 
movement of people from either BMAC- or 
Indus periphery-derived populations into the 
Xinjiang region during the IA (Fig. 3A), most 
likely through the IAMC route over the Pamir 
and Tianshan Mountains. Notably, we found 
nine IA individuals to be genetically similar 
to previously identified Saka populations, and 
they could be modeled as a single source with 
TianShan and Central Sakas using qpAdm 
(Fig. 3A and tables S9 and S10) (27). Together 
with the results discussed above, we show an 
increase of external ancestries appearing in 
Xinjiang during the IA originating from mul- 
tiple Central, South, and East Asian sources, 
demonstrating a broadening of demographic 
contacts and exchanges between groups from 
the surrounding areas, with IA nomads, such as 
the Sakas and the Xiongnu, playing a substan- 
tial role. Further, although the spread of lan- 
guages is not always congruent with population 
histories (32), the presence of Saka ancestry 
in Xinj_IA populations supports an IA intro- 
duction of the Indo-Iranian Khotanese language, 
which was spoken by the Saka and later attested 
to in this region (19). 


East Asian ancestry increased from the BA to 
the IA 


Archaeological discoveries of similar copper 
and bronze objects in the neighboring regions 
of Xinjiang and Gansu-Qinghai suggest an im- 


portant link with East Asia and the likely in- 
troduction of metallurgy into northwest China 
(33, 34). Along with such objects, the presence 
and extent of East Asian ancestry in the region 
is key to defining the movement and contacts 
between BA and IA Xinjiang populations with 
different parts of East Asia. Northeast Asian 
ancestry from the Siberian Lake Baikal region 
(Shamanka) is present in BA Xinjiang pop- 
ulations, where, apart from the major Steppe 
ancestry, some admixture models also require 
Northeast Asian sources (Fig. 3A, fig. S27, and 
table S3) (21). In Songshuguo in northern 
Xinjiang, ~92% of the genome of one BA in- 
dividual (Xinj_BA7_oEA) could be sourced 
from Shamanka_EBA (Fig. 3A and table S5) 
(21). During the IA, an increase in the East 
Asian ancestry component is frequently ob- 
served among Xinjiang samples, and the dis- 
tribution of Xinj_IA individuals on the PCA 
follows a west-to-east cline of increasing East 
Asian ancestry (Fig. 2A and figs. S1 to $3) (27). 
This increase in East Asian ancestry com- 
pared with that in the BA is also observed in 
ADMIXTURE analysis (Fig. 2B and fig. S5) 
(21) and is supported by outgroup /3-statistics 
(fig. S9) (22). The prevalence of both East 
Asian- and Steppe-related ancestry is further 
supported by the comparisons using /4 (WSH, 
Shamanka_EN; Xinj_Pop, Mbuti) and {4 (WSH, 
BMAC/Indus Periphery; Xinj_Pop, Mbuti), with 
two separate groups among IA populations— 
one having more affinity for East Asians and the 
other with more affinity for Steppe-related pop- 
ulations (fig. S20) (27). 

Among the ancient East Asian populations, 
most of the Xinjiang BA and IA populations 
show a higher affinity in outgroup-/3 results 
with ancient Northeast Asian Neolithic pop- 
ulations from Siberia, such as Shamanka_EN, 
Lokomotiv_N, and DevilsCave_N, and eastern 
Steppe IA populations, such as the Xiongnu, 
than with the northern (Boshan) or southern 
(Man_Bac) East Asian populations (figs. S9 and 
$25) (21). Among the present-day East Asian 
populations, ancient Xinjiang populations in 
general show the highest affinity for Uyghur-, 
Austronesian (Ami)-, Hmong-Mien (She)-, and 
Sino-Tibetan (Han)-speaking populations 
rather than for Austroasiatic (Cambodian)- 
and Tai-Kadai (Dai)-speaking people (figs. S10 
and S25) (27). The uniparental markers with 
mitochondrial haplogroups C, D, M, and N in 
68 individuals from the north, south, and west, 
and the presence of Y-chromosomal haplogroup 
03a2c in four individuals from the south and 
east also reflect the high degree of East Asian 
ancestry in Xinjiang (35) (table S1 and fig. S26) 
(21). QpAdm models revealed the primary 
source of East Asian ancestry found in BA 
Xinjiang to be similar to the Northeast Asian 
ancestry common among Neolithic Mongolia 
(Mongolia_N), Lake Baikal (Shamanka_EN), 
and Yellow River (YR_MN) populations (tables S3 
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and S4) (J6). By contrast, the East Asian com- 
ponents of IA populations are more diverse. 
Although models of some populations support 
ancestry from either Xiongnu or Shamanka_ 
EBA (e.g., Xinj_IA14_ aSte), other [A populations 
work with only Xiongnu (and Han-related popu- 
lations) (e.g., Xinj_IA3_aEA) or Shamanka_EBA 
(e.g., Xinj_IA4 aEA and Xinj_IA7_aEA) (tables 
S9 and S10 and fig. S28) (21). The appearance of 
Xiongnu-related ancestry in IA Xinjiang aligns 
with the westward expansion of the Xiongnu 
in ~2200 B.P., after the defeat of the Yuezhi in 
Gansu and the Tarim Basin (J). A general in- 
crease in broad East Asian ancestry beginning 
in the IA, in addition to the continued pres- 
ence of the Northeast Asian ancestry of the BA, 
is observed in outgroup-f3 statistics (fig. S25) 
(21). Although this may be the result of Han- 
related ancestry components previously docu- 
mented in the Xiongnu (17), our results were 
unable to trace these East Asian streams to 
specific populations and cannot rule out the 
possible movement of additional East Asian 
groups into the region beginning in this period. 


Genetic continuity of IA mixed ancestry in 
HE populations 


Our analysis of IA Xinjiang populations re- 
vealed intense mobility and genetic exchange 
across the region, with growing contributions 
of ancestry from East and South Asia. Very few 
genomic data exist from the post-IA popula- 
tions in the HE, but this information is crucial 
to trace the contributions that post-[A demo- 
graphic changes had on historical populations 
and the evolution of modern Xinjiang genomic 
diversity. Political control of the region passed 
between the Xiongnu and the Han Dynasty 
until the third century CE, and in the 6th 
century CE, the Turkic Khaganate established 
control over Xinjiang, later to be driven out 
by the Tang Dynasty. In the ensuing centu- 
ries, Xinjiang fell under the various influen- 
ces of Tibetans, Uyghurs, and Mongols (5). To 
document the effects that these events may 
have had on Xinjiang populations, we se- 
quenced and analyzed 27 HE (Xinj_HE) in- 
dividuals from 15 sites (table S1) (27). These 
Xinj_HE individuals clustered with the Xinj_IA 
individuals and similarly overlapped with 
Steppe nomadic groups, such as Tianshan 
Sakas, Huns, and Central Sakas, on a PCA 
(Fig. 2A and fig. S6) (27). 

Similar to the IA individuals, the Xinj_HE 
populations shared the most alleles with Steppe_ 
MLBA populations and present-day Europeans 
in outgroup-/3 analysis, with outlier individuals 
showing more affinity for either East Asian or 
Central Asian populations (figs. S9, S10, S23, and 
$24) (2D. Comparisons of South and Central 
Asian ancestry using f4(Gonurl_BA, Gonur2_BA; 
Xinj_HE, Mbuti) showed two individuals [AYSG_ 
HE_oEA and BYH_HE_oEA (Z > -2.3, where 
Z is the number of standard deviations from 
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zero) ]| with a greater affinity to Indus periphery 
(Gonur2_BA) than to BMAC (Gonurl_BA) an- 
cestry (table S14) (27). Similar to the ancestry 
present in the IA, we observed three major 
ancestry sources of Steppe_EMBA or Steppe_ 
MLBA (17 to 73%), BMAC (10 to 33%), and 
East Asian (9 to 57%), with several excep- 
tions (Fig. 3A and tables S11 and S12) (21). 
We grouped the 27 individuals of the HE into 
16 subgroups, analyzing HE individuals and 
populations showing unique ancestries sepa- 
rately. Two HE individuals were found to 
share a considerable amount of ancestry de- 
rived from the SPGT population—JMCY_HE2_ 
olran and BYH_HE_oEA from Yili and the 
northeastern part of Xinjiang, respectively— 
which may indicate a continued inflow of 
South and Central Asian ancestry into Xinjiang 
(Fig. 3A and table S11) (27). In addition to the 
Steppe MLBA influence, we found several 
qpAdm working models that included Steppe_ 
EMBA or Xinj_BA1_TMBA1 ancestry as one of 
their components, two of them also containing 
the Rib1 haplogroup (tables S11 and $12) (27). 
We also identified one individual from the HE 
Yili region (TBLS_HE1) with a working qpAdm 
model having both Afanasievo or Xinj_BA1_ 
TMBAI (18 to 39%) and Xinj_LBA1 (Andronovo) 
or Sintashata_LBA (29 to 63%) ancestry sources 
but having high affinity to Andronovo, because 
the qpAdm model works with Afanasievo as 
an outgroup (tables S11 and S12) (27). We also 
identified seven populations with Tarim_EMBA 
(16 to 34%) and Andronovo ancestry (22 to 63%) 
(tables S11 and S12) (27). Taken together, these 
findings document examples of first-wave Steppe 
EMBA ancestry persisting into the HE. 

To study the possible phenotypic effects of 
these ancestry shifts, we investigated predicted 
eye, hair, and skin pigmentation using the 
HIrisPlex-S system (36) on the higher-coverage 
individuals (n = 38 to 64; table S15) spanning 
all time periods. We observed the appearance 
of lighter hair color and skin tone in north and 
west Xinjiang beginning in the LBA and con- 
tinuing into the IA. Blue eye alleles also ap- 
peared in these regions at least during the IA, 
with one to two early IA samples having blue 
eyes in the Yili region (figs. S29 to S31 and table 
S15). Light eye and hair pigmentation has pre- 
viously been identified in people associated 
with the Steppe_MLBA Andronovo culture 
(37), with which several of these individuals 
are associated. Although we have phenotype 
results from only five HE samples from all 
regions, we note that all five have darker hair 
and eye pigmentation, corresponding to a pe- 
riod of increased ancestry from East, South, 
and Central Asia, although we interpret this 
with caution because of the extremely small 
sample size. 

Notably, ancestry sources identified in the 
IA—Steppe, BMAC, and East Asian—are also 
observed in present-day Xinjiang Uyghurs (38), 


which suggests a genetic continuity from the 
IA to present-day Xinjiang populations. In the 
PCA analysis with present-day Xinjiang pop- 
ulations, HE individuals overlapped with peo- 
ple from Central Asian Tajikistan, Kazakhstan, 
and Uzbekistan; Turkmens; and Uyghur peo- 
ple from Xinjiang (fig. $3) (22, and individ- 
uals having more Steppe-related affinity shared 
greater drift with present-day Uyghurs. 

Thus, despite multiple influences by several 
nomadic groups in post-IA Xinjiang, we find 
a surprising degree of genetic continuity, in 
that the mixed ancestry present in the IA is also 
present in historical Xinjiang populations, and 
the increased influences of Central Asian and 
East Asian ancestries dating to the IA are still 
shared between historical and present-day 
Xinjiang populations. We caution however, that 
the sample size of our historical sites is not as 
robust as those of the BA and IA sites. A broader 
analysis covering a larger geographical area 
could better verify the trends observed with 
the current datasets. 


Discussion 


Beginning with ancestral sources traced to indig- 
enous ANE-derived and BA western and eastern 
Steppe populations, Xinjiang population struc- 
ture can be characterized by waves of incoming 
gene flow and admixture from surrounding 
populations adding to the extant ancestry. 
The BA Xinjiang region contained four major 
ancestries, which included Tarim_EMBA1 
(Xinj_BA1_TMBA\1) (J6), Afanasievo (Xinj_ 
BA5_oAfan), Northeast Asian (Xinj_BA7_oEA), 
and BMAC (Xinj_BA6_aBMAC), with Tarim_ 
EMBAI possibly being indigenous to the region 
given its presence among diverse BA individu- 
als (Fig. 3B). The Mongolian Chemurchek in- 
habitants near the Altai region were further 
linked to Xinjiang through the Chemurchek 
culture of BA northern Xinjiang, demonstrating 
the BA movement of people across the Altai 
region. Thus, we not only find support for 
both the Steppe and Bactrian oasis hypothe- 
ses (5, 13-15), but the identification of addi- 
tional ancestries suggests further complexity 
of the EBA populations in Xinjiang. Additional 
sampling of pre-BA and EBA populations will 
be necessary to further characterize the succes- 
sion of ancestries established in Xinjiang during 
this time period. 

Later, in the LBA, ancestry present in the 
Central Asian BMAC populations becomes 
more pronounced, which was likely to have 
entered Xinjiang though the IAMC route 
along with the Steppe_MLBA populations, 
such as the Andronovo, Sintashta, and Dali 
(Botai related ancestry) populations (Fig. 3B). 
The entrance of Steppe MLBA (~3900 B.P.) 
into Xinjiang correlates with the arrival of the 
eastern Fedorovo subculture of the Andronovo 
(~3750 to 3500 B.P.) from the Tianshan Moun- 
tains (39). The IA is marked by an increase in 
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movement and admixture of Steppe, Central 
Asian, and East Asian people into the Xinjiang 
region. The IA oversaw a continuation of 
Steppe_MLBA ancestry with greater genetic 
affinity to Central Asian populations containing 
BMAC ancestry. We also observed ancestries 
derived from South Asian Hunter Gatherer 
(Onge) in the LBA and IA, which suggests 
the movement of populations either from 
Central Asia already carrying this ancestry 
or from the Indus periphery region through the 
Pamir mountain regions into southern Xinjiang 
(i). Concurrently, an IA influx of East Asian 
ancestry from the eastern Steppe of present-day 
Mongolia is also observed, which may be tied to 
the westward expansion of the Pazyrk Xiongnu 
into Xinjiang. These admixed ancestries related 
to Steppe, East Asian, and Central Asian people 
established in the IA have been maintained 
since that time and are still prevalent in both 
the HE and present-day Xinjiang, linking the 
past with present-day populations. Whereas 
aspects of this reconstructed population history 
find support in the archaeological record, sev- 
eral insights can be gained by comparing the 
newly generated genomic data with previous 
archaeological and historical evidence. 

First, although the diffusion of culture is 
not always accompanied by population move- 
ments (32), we observed an overall concordance 
between the two in Xinjiang populations. For 
example, the major genetic influences pres- 
ent at the earliest settlements of north and 
west Xinjiang populations can be related to 
the coexistence of people with different cultural 
backgrounds—e.g., Afanasievo, Chemurchek, 
and Okunevo (Fig. 3A and supplementary text). 
Also, the shift in population ancestries can be 
associated with proposed population move- 
ments. Specifically, the Afanasievo-related an- 
cestry in BA individuals is consistent with the 
concurrent appearance of the Yamnaya culture 
in the Altai-Sai region, and the Steppe_MLBA 
ancestry in LBA individuals can be attributed 
to the expansion of the Steppe_MLBA culture 
into Xinjiang (see the supplementary text, which 
provides archaeological backgrounds). In the IA, 
genetic affinities with nomadic groups such as 
the Saka reveal the widespread presence of these 
groups across the entire Steppe region. Overall, 
we detect Tarim_EMBA ancestry surviving into 
the IA and HE, which implies the cohabitation 
of populations descended from Steppe EMBA, 
BMAC, and local Tarim_EMBA people in 
Xinjiang (Fig. 3B). These findings also give 
genomic evidence for the broad demographic 
processes underlying the spread of numerous 
languages in Xinjiang that have survived into 
historical times, such as the introduction of 
Tocharian languages by populations associated 
with the Afanasievo culture. An increasing 
mobility and movement of the Sakas in the IA 
and the establishments of the Saka states 
lasting into the HE aided in the expansion 
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of Indo-Iranian languages, such as Khota- 
nese, in Xinjiang as well. 

Despite the widespread population move- 
ments documented here, the degree of genetic 
continuity that has been maintained in Xinjiang 
over the past 5000 years is noteworthy. Although 
genetic continuity has been observed in isolated 
environments or regions with relatively high 
cultural homogeneity, such as Northeast Asia 
(40), dynamic interactions between populations 
with diverse ancestries and cultures are more 
likely to result in major population shifts and 
turnover, such as in the Oceania archipelago 
and Europe (28, 41). However, this has not been 
the case for Xinjiang populations, where at least 
two different instances of genetic continuity are 
observed. The first is the genetic continuity 
(Steppe ancestry) from BA individuals to LBA 
and IA individuals, which represents a case in 
which a core Steppe ancestry has been main- 
tained despite the addition of an extensive 
influx of diverse ancestries. The second case 
is the stability of Xinjiang population diversity 
from the HE to the present day, despite the 
turmoil of successive external ruling powers 
over the past 2000 years. A major reason may 
be that this mixed ancestry was prevalent not 
just in Xinjiang but throughout Central Asia, 
so dynamic population movements would not 
result in major genetic shifts. These findings 
indicate that genetic and archaeological evi- 
dence can provide distinct yet complementary 
insights into population history. This, in turn, 
further emphasizes the importance of a mul- 
tidisciplinary approach to uncover the complex 
histories of regions like Xinjiang, where persist- 
ent interactions between various populations 
and cultures occurred. 
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A keystone gene underlies the persistence 
of an experimental food web 


Matthew A. Barbour", Daniel J. Kliebenstein”, Jordi Bascompte* 


Genes encode information that determines an organism’s fitness. Yet we know little about whether genes 
of one species influence the persistence of interacting species in an ecological community. Here, we 
experimentally tested the effect of three plant defense genes on the persistence of an insect food web 
and found that a single allele at a single gene promoted coexistence by increasing plant growth rate, 
which in turn increased the intrinsic growth rates of species across multiple trophic levels. Our discovery of 
a “keystone gene” illustrates the need to bridge between biological scales, from genes to ecosystems, 
to understand community persistence. 


enes and their alleles underlie popula- 

tion persistence (J), yet their impact on 

ecological communities remains unclear. 

Variation among individual genotypes 

can scale up to influence the diversity 
and composition of associated communities A 
(2-6); however, the causal genes and specific 
alleles underlying these indirect effects remain 
elusive. Tracing the ecological impact of spe- 
cific genetic variants has the potential to trans- 
form how we act to conserve genetic and species 
diversity in a changing world (7). 

To study the genetic basis of community- 
level effects, we used an experimental food web 
consisting of a plant (Arabidopsis thaliana), 
two species of aphids (Brevicoryne brassicae 
and Lipaphis erysimi), and a parasitoid wasp NA 
(Diaeretiella rapae) (8) (Fig. 1A). These species cae AB & 
form a naturally occurring food web (9, 10), a.) } \ 
and their interactions are partly mediated by / 
a group of specialized metabolites called ali- 
phatic glucosinolates (9). Extensive knowledge 
of genotype-to-phenotype causality in Arabi- 
dopsis aliphatic glucosinolates facilitates test- 
ing of how genetic change influences ecological 
interactions in a food web (Fig. 1, B and C). Spe- 
cifically, loss-of-function mutations at three genes 
(MAMI, AOP2, and GSOH) control most of the 
natural variation in the aliphatic glucosinolate 
phenotype (17-15). These three genes function 
in a sequential pathway, such that the presence 
of a functional allele at a given gene allows 
biosynthesis to proceed to the subsequent chem- 
ical compound, whereas a loss-of-function muta- 
tion (hereafter, null allele) diverts biosynthesis, 
resulting in the accumulation of the preced- 
ing compound. Geographic variation at MAM1 


L. erysimi selects for an increased frequency of 
the null allele, whereas B. brassicae selects for 
the functional allele (9). Aphids do not appear 
to impose selection on AOP2 (9); however, 


one 
i 

+ 
7 


natural variation at AOP2 has pleiotropic ef- 
fects on plant growth, phenology, jasmonic 
acid signaling, and circadian rhythms (6, 17), 
which could influence food-web interactions. 
Although its role in aphid herbivory is unclear, 
GSOH contributes to adaptive variation in ali- 
phatic glucosinolates of Arabidopsis in its 
native range (78). 

To mimic natural variation at these three 
genes, we used three transgenic lines (gsm1, 
AOP2, and AOP2/gsoh) that were derived from 
a common genetic background (Col-0 geno- 
type) in Arabidopsis (8). These transgenic lines 
are analogous to the evolutionary processes 
that determine natural variation in aliphatic 
glucosinolates and also mimic the chemical 
phenotype of diverse natural accessions that 
co-occur throughout Europe (79-21). We cre- 
ated experimental plant populations from dif- 
ferent combinations of Arabidopsis genotypes 
(n = 11 combinations across 60 experimental 
food webs), including monocultures (n = 4), 
two-genotype mixtures (n = 6), and the four- 
genotype mixture (7 = 1). This experimental 
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is correlated with the relative abundance of 
aphids in our food web, with evidence that 
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Fig. 1. Study system. (A) The wasp Diaeretiella rapae (top) parasitizes the aphids Brevicoryne brassicae 
(left) and Lipaphis erysimi (right), and these aphids compete for their shared resource Arabidopsis thaliana 
(bottom). Solid and dashed arrows represent positive and negative effects, respectively. (B) We used four 
Arabidopsis genotypes (gsml, Col-0, AOP2/gsoh, and AOP2) that recreate natural variation in null (—) and 
functional (+) alleles at three genes (MAMI, AOP2, and GSOH) that control the biosynthesis of aliphatic 
glucosinolates. (C) The chemical phenotype (3MSO, 4MSO, But-3-enyl, or OH-But-3-enyl) of each Arabidopsis 
genotype (color-coded boxes) depends on which genes have null and functional alleles. 
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design allowed us to both test a general ef- | Fig. 2. Allelic differences at 
fect of genetic diversity and isolate the effect | AOP2 disproportionately affect 
of different alleles at specific genes (8). We | food-web persistence. The 
conducted the experiment under two differ- | dashed line represents the effect a 
ent temperature regimes (20°C and 23°C), | of the average allele on the 2 
which reflect the warming this food web is | overall extinction rate after o 
expected to experience over the next 25 to | controlling for genetic diversity. g 
50 years (8, 22). After adding the experimen- | We arbitrarily chose the func- & a ee eee 
tal food web to each plant population, plants | tional (+) allele of each gene as 8 
were visually inspected every 7 days, and we | the baseline; therefore, points Fs 
counted the number of each aphid species, | represent the estimated effect of 
parasitized aphids (mummies), and adult par- | the null (-) allele at each gene, 08 
asitoids for 17 weeks, allowing for multiple | with thick (SEM) and thin lines 
generations of aphids (~16 generations) and | (95% Cl) indicating uncertainty. 
the parasitoid (~8 generations). Mummified | Extinction rate ratio is plotted : : : 
aphids contain a larval-stage parasitoid, so we | on a logarithmic scale to AOP2 gene MAM1 gene GSOH gene 
combined counts of mummified aphids and | make positive and negative 
adult parasitoids to estimate parasitoid abun- | values symmetric. 
dance (8). 

We first tested for genetic effects on food- 
web persistence—the overall rate of species 
extinctions—using a Cox proportional-hazards AOP2- allele AOP2* allele 


survival analysis (8) (figs. S1 to S3 and table 
S1). We found that, on average, adding another 
genotype to the plant population reduced the 
extinction rate by 19% {extinction rate ratio [95% 
confidence interval (CI)] = 0.81 [0.66-0.99], 
Wald Z = -2.11, P = 0.035}. We also observed 
that specific genes varied in their contribution 
to food-web persistence (Fig. 2). We found that 
allelic differences at AOP2 had a dispropor- 
tionate effect on extinction rate, whereas dif- 
ferent alleles at MAM7 and GSOH had no clear 
effects (Fig. 2 and table S1). In particular, re- 
placing the average genotype with one that 
had a null AOP2- allele reduced the extinction 
rate by 16% (Fig. 2, Wald Z = -2.32, P = 0.021). 
This reduction is larger (29%) if we make the 
comparison relative to the functional AOP2+ 
allele, which better characterizes the ecolog- 
ical impact of this loss-of-function mutation. 
Analogous to a keystone species that deter- 
mines species diversity in a food web (23), our 
results indicate that AOP2 functions as a key- 
stone gene in this food web through its dis- 
proportionate contribution to maintaining 
species diversity (24). Indeed, we no longer ob- 
served an effect of genetic diversity if we first 
account for variation explained by genotypes 
with a null AOP2- allele (extinction rate ratio 
[95% CI] = 0.96 [0.74-1.26], Wald Z = -0.27, 
P = 0.789). AOP2’s keystone role was also ro- 
bust to experimental warming (extinction rate 
ratio [95% CI] = 0.98 [0.73-1.30]; Wald Z = 
-0.17, P = 0.865). 

To understand how variation at AOP2 de- 
termined food-web persistence, we analyzed 
its role in determining transitions in food-web 
structure using a multistate Markov model (8) 
(fig. S4 and table S2). Figure 3 provides a map 
of the food-web transitions that we observed 
over the course of this experiment. Overall, 
we found that the initial food web (yellow in 
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Fig. 3. AOP2- allele maintains species diversity by preventing the transition of the three-species food chain 
to the Arabidopsis-only state. Each color corresponds to a different food-web structure, and arrows between 
two colors indicate a food-web transition. Arrow thickness is proportional to the percentage change in weekly risk 
of a transition (numbers) when either a genotype with a null AOP2- allele (left) or functional AOP2+ allele (right) is 
added to the plant population. Solid and dashed arrows indicate positive and negative changes, respectively; black 


Fig. 3) underwent a transition where either 
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and gray arrows denote clear (P < 0.05) and unclear effects, respectively. Rare transitions were not tested (n.t.). 
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Fig. 4. AOP2- allele fosters 0.00 ~ 
coexistence by increasing the 
aphid and parasitoid’s intrinsic 
growth rates. The gray area 
represents the range of intrinsic 
growth rates (normalized to 
length 1) where the aphid and 
parasitoid coexist but the 
parasitoid goes extinct beyond 
the lower boundary. Arrows 
represent the composite vector 
of aphid and parasitoid growth 
rates in two-genotype mixtures 
with either AOP2- alleles 

(Col-O and gsml, solid arrow) 

or AOP2+ alleles (AOP2 and 
AOP2/gsoh, dashed arrow) (8). 


-0.25 


-0.50 


Parasitoid growth rate (normalized) 


-0.75 


0.00 


all insects went extinct (10 out of 60 cages; 
yellow to purple in Fig. 3) or, more commonly, 
it transitioned into a three-species food chain 
with the aphid L. erysimi (50 out of 60, yel- 
low to green in Fig. 3). This remaining food 
chain either persisted for the duration of the 
experiment (8 out of 50), or transitioned into 
either an aphid-only food web (25 out of 50, 
green to blue in Fig. 3) or Arabidopsis-only 
state (17 out of 50, green to purple in Fig. 3). 
The keystone role of AOP2 acted primarily to 
prevent the transition from the three-species 
food chain to the Arabidopsis-only state (green 
to purple), with a null AOP2- allele reducing 
the weekly risk of this food-chain transition 
by 67% (Fig. 3, left; hazard ratio [95% CI] = 
0.33 [0.15-0.73]). 

To decipher the specific mechanism by which 
allelic differences at AOP2 altered the risk of 
the food chain collapsing, we applied theory 
on the structural stability of food webs (8). 
Following recent work, we say that a food web 
is structurally stable for as long as it remains 
both dynamically stable and fully feasible (25, 26). 
The geometry of intra- and interspecific inter- 
actions defines the boundary of the food web’s 
feasibility domain, whereas species’ popula- 
tion growth rates at low density (hereafter, 
intrinsic growth rates) determine the proxim- 
ity of the food web to these boundaries (26). At 
the boundary of this feasibility domain, at least 
one species will go extinct, thus transitioning to 
asimpler community. Therefore, the distance— 
or, to be precise, the normalized angle—from 
an extinction boundary is a measure of the 
structural stability of the food web. We applied 
a Bayesian multivariate autoregressive [MAR(1)] 
model to our time series of species’ abundances 
to quantify AOP2’s effect on interactions and 
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species’ intrinsic growth rates, and thus the 
risk of the food chain collapsing (8) (figs. S5 
to S9 and tables S3 and S4). 

We found that the null AOP2- allele reduced 
the risk of extinctions in the food chain in a 
specific way (Fig. 4). This reduced risk was not 
because this allele altered the location of the 
extinction boundary (lower border of gray area 
in Fig. 4). Rather, it was because the AOP2- allele 
moved the vector of species’ intrinsic growth 
rates (solid arrow) upward and into the region 
where the parasitoid and aphid coexist (Fig. 4 
and fig. S8, A normalized angle [95% CI] = 13.0° 
[3.9-23.5]), a result that holds under nonequi- 
librium conditions (fig. S9). This buffering effect 
was primarily due to a concordant increase in 
the aphid’s (A7,. erysimi [95% CI] = 0.54 [-0.04- 
1.13]) and parasitoid’s (Arp. rapae [95% CI] = 
0.66 [0.09-1.25]) intrinsic growth rates. These 
increases in intrinsic growth rates were likely 
due to a bottom-up effect of the faster relative 
growth rate of Arabidopsis genotypes with a 
null AOP2- allele (8) (linear mixed-effect model 
analysis of variance: F579 = 15.75, P = 0.008; 
fig. S10 and table S5), as has been observed in 
other plant-herbivore-parasitoid food chains 
(27-29). 

Our results show that variation at a single 
gene can control the persistence of directly and 
indirectly interacting species in a food web. 
Given that food chains of plants, insect herbi- 
vores, and their parasitoids comprise ~40% of 
all described Eukaryotes (30), this mechanism 
may have far-reaching consequences for the 
persistence and functioning of terrestrial eco- 
systems. These consequences may be context 
dependent though, as theory predicts that 
increased intrinsic growth rates may actually 
destabilize food chains in more productive en- 


vironments (37). Field-based experiments are 
ultimately needed to test the relevance of our 
results. Still, our work highlights the poten- 
tial promise of integrating tools from genetics 
and ecological networks to predict the conse- 
quences of genetic change for the persistence 
of biodiversity across scales. In particular, our 
study suggests that the current loss of genetic 
diversity (32) may have cascading effects that 
cause abrupt and catastrophic shifts in food- 
web structure and function. 
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SOLAR CELLS 


Damp heat-stable perovskite solar cells with 
tailored-dimensionality 2D/3D heterojunctions 


Randi Azmi, Esma Ugur’, Akmaral Seitkhan’, Faisal Aljamaan’, Anand S. Subbiah’, Jiang Liu’, 
George T. Harrison’, Mohamad I. Nugraha’, Mathan K. Eswaran’, Maxime Babics?, Yuan Chen’, 
Fuzong Xu’, Thomas G. Allen’, Atteq ur Rehman’, Chien-Lung Wang’, Thomas D. Anthopoulos?, 
Udo Schwingenschlégl', Michele De Bastiani'+, Erkan Aydin’, Stefaan De Wolf** 


If perovskite solar cells (PSCs) with high power conversion efficiencies (PCEs) are to be commercialized, 
they must achieve long-term stability, which is usually assessed with accelerated degradation tests. 
One of the persistent obstacles for PSCs has been successfully passing the damp-heat test (85°C and 
85% relative humidity), which is the standard for verifying the stability of commercial photovoltaic 
(PV) modules. We fabricated damp heat-stable PSCs by tailoring the dimensional fragments of 
two-dimensional perovskite layers formed at room temperature with oleylammonium iodide molecules; these 
layers passivate the perovskite surface at the electron-selective contact. The resulting inverted PSCs 
deliver a 24.3% PCE and retain >95% of their initial value after >1000 hours at damp-heat test conditions, 
thereby meeting one of the critical industrial stability standards for PV modules. 


ommercialization of any photovoltaic (PV) 

technology requires a guaranteed product 

lifetime of at least 25 to 30 years, as is 

common for conventional crystalline sili- 

con (c-Si) PV modules. Lifetime predictions 
of PV technologies are usually accomplished 
through standardized accelerated degradation 
tests. After the demonstration of excellent power 
conversion efficiencies (PCEs) of perovskite 
solar cells (PSCs), the main challenge toward 
market entry of PSCs is successfully passing 
standard industrial lifetime assessment tests 
of the International Electrotechnical Commis- 
sion (IEC 61215:2016)—in particular, damp-heat 
testing at 85°C and 85% relative humidity 
(, 2). A stabilized PCE performing like a com- 
mercial c-Si solar cell (PCE ~20%) would need 
to pass a damp-heat test for >1000 hours with 
<5% relative loss in PCE (3, 4). 

Degradation of encapsulated PSCs is usually 
caused by leakage in the packaging (allowing 
atmospheric agents to interact with the perov- 
skite) and device-related material instabilities. 
To address this, we developed leakage-free 
device packaging that seals the PSC within 
two glass sheets, using a vacuum-laminated 
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encapsulant and edge sealing via butyl rubber. 
Despite this, damp-heat testing of our en- 
capsulated control devices resulted in fast 
degradation (see below), implying an intrinsic 
thermal instability of the perovskite absorber 
layer itself. 

The instability of three-dimensional (3D) 
perovskite films, when used as the photoactive 
absorber layer in PSCs, is mainly attributed to 
high defect densities as well as ion migration 
at grain boundaries and interfaces, which is 
exacerbated at higher operational tempera- 
tures (5-9). Several approaches have been re- 
ported to passivate these defects (7, 10-13). 
Specifically, growing 2D perovskite layers on 
the top surface of 3D perovskites creates a 2D/ 
3D perovskite heterojunction that can effec- 
tively passivate surface defects and suppress 
ion migration (3, 6, 14-18). 

At the device level, integrating such 2D 
perovskite passivation layers in PSCs can en- 
hance their PCE and lifetime (3, 4, 14-17). So 
far, this strategy has been most successful for 
“regular structured” PSCs in which phase-pure 
2D perovskites (n = 1, where 7 represents the 
dimensionality of the 2D perovskite by count- 
ing the number of its octahedral inorganic 
sheets) were inserted between the 3D perov- 
skite surface and the (opaque) hole-selective 
top-contact stack (3, 14-17). For “inverted” de- 
vices, this top-contact passivation approach 
(now at the electron-selective side) has con- 
sistently failed in PCE and lifetime; this rep- 


resents a persistent challenge in the perovskite 
community (19, 20), as inverted PSCs are ar- 
guably easier to fabricate and scale up (J7). 

We found that tailoring the dimensionality 
(n) of the 2D perovskite fragments at the 
electron-selective interface of inverted PSCs is 
essential to enable efficient top-contact pas- 
sivation through 2D perovskite passivation 
layers. This interface has frequently been 
ignored because it is assumed that the con- 
ventional electron-selective layer, Ceo (or its 
derivatives), provides sufficient passivation of 
3D perovskites (27); instead, attention has pre- 
dominantly focused on the hole-selective 
interface of inverted PSCs, situated at the 
(transparent) bottom contact of the device 
(22-25). However, recent reports have revealed 
that Cg is only weakly bonded to perovskite 
layers, which induces a high energetic dis- 
order between perovskite and C¢p layers that 
limits device performance at elevated operat- 
ing temperatures (5, 6, 26). Moreover, a thin 
layer of Cgo is insufficient to effectively protect 
the 3D perovskite film underneath from mois- 
ture or oxygen ingress. Implementing 2D perov- 
skite passivation layers is a promising approach 
to solve all of the issues mentioned above. 

We post-treated the surface defects of the 
3D perovskites by applying oleylammonium 
iodide (OLAI) molecules to form Ruddlesden- 
Popper-phase 2D perovskite layers, which re- 
sulted in higher PCEs and prolonged stabilities 
of inverted PSCs. We tailored the dimension- 
ality, n, of the 2D-perovskite fragments (which 
also dictates their optical and electronic prop- 
erties) by tuning the annealing conditions 
from lower to higher temperatures, in that 
higher-n layers have a lower formation energy 
(27). Indeed, all 2D perovskite passivation 
layers prepared through thermal annealing 
(2D-TA) showed a dominant emission peak 
at ~510 nm (as evidenced in fig. $1), which 
belongs to n = 1, in accordance with previous 
studies (4, 6, 19, 28). However, the formation 
of higher-dimensionality 2D perovskite layers 
(n = 2) became more pronounced when the 
post-treatment was performed at room tem- 
perature (2D-RT) with the OLAI molecule 
(Fig. 1A). 

We investigated the formation of 2D perov- 
skite passivation films on 3D perovskites with 
grazing-incidence wide-angle x-ray scattering 
(GIWAXS; Fig. 1, B and C). The 2D perovskite 
passivation films exhibited diffraction g, peaks 
at ~0.2 to ~0.5 A“, corresponding to the (001) 
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Fig. 1. Structure of 2D perovskite dimensionality control by tuning the 
annealing conditions. (A) Schematic illustration of 2D perovskite passivation 
with different n layers under thermal annealing at 100°C (TA) and room- 
temperature process (RT). (B) Integrated intensity of GIWAXS data along q,. 


and (002) planes of 2D perovskite crystals (28). 
As expected, the 2D-TA films were dominated 
by n = 1 layers (with a prominent peak at q, = 
0.35 A”). In contrast, 2D-RT films exhibited 
the diffraction peaks of n = 1 and n = 2, witha 
more substantial n = 2 peak at lower gq, (J6). 
The strong intensity in the z-direction for 2D 
perovskite films was indicative of a highly 
oriented lateral direction of the top 3D perov- 
skite layers. 

Cross-sectional high-resolution scanning 
transmission electron microscopy (HR-STEM) 
images also showed 7 = 1 and n = 2 layers in 
2D-RT samples (Fig. 1D and fig. S2A) but only 
n = 1in 2D-TA (fig. S2C), which is consistent 
with the GIWAXS results. To differentiate be- 
tween n = 1 and n = 2 layers, we performed 
elemental mapping images and profiling posi- 
tions of 2D perovskite layers; Fig. 1E and fig. 
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(C) GIWAXS maps of each 
(D) and high-angle annular dark-field (HAADF)/energy-dispersive x-ray 


spectroscopy elemental m 
2D-RT samples. 


S2B show two-dimensionalities of 2D layers in 
2D-RT samples as the reduction of the density 
of C, Pb, and I elements, which correspond to 
n =1and 7 = 2 layers. In comparison, 2D-TA 
samples showed only 7 = 1 (fig. S2D). Further, 
profiling the position of 2D perovskites from 
TEM images confirmed the dimensionality of 
n=2and n= 1by analyzing average distances 
between the two closest octahedral inorganic 
sheets. As a result, higher dimensionality (7 = 
2) had a wider distance (~1.5 nm) relative to 
n =1(~1.2 nm) (fig. $2, E to H). 

Scanning electron microscopy (SEM) top- 
view images revealed that the surface morphol- 
ogy of the perovskite films after 2D perovskite 
passivation did not change substantially (fig. 
S3, A and B). Further, the 2D perovskite pas- 
sivation films exhibited stronger photolumines- 


qdA") 


3D perovskite 


film. (D and E) Cross-sectional HR-STEM image 


ap (E) of the cross-sectional STEM image of the 


lifetime than control 3D perovskite films be- 
cause of the suppression of nonradiative re- 
combination associated with trap states at 
the surface (figs. S4 and S5). Interestingly, 
the 2D perovskite (n = 2) capping layer formed 
uniformly on top of 3D perovskite surfaces 
for 2D-RT, as shown in the PL images in 
Fig. 2A. In addition, PL spectra in 2D-TA 
samples showed a dominant emission peak 
corresponding to 7 = 1, whereas a PL emission 
associated with a higher dimensionality of 
n = 2 is more pronounced in 2D-RT sam- 
ples (see Fig. 2B), in accordance with GIWAXS 
and TEM results. 

The energy-level diagrams of [2-(9H-carbazol- 
9-yl)ethyl]phosphonic acid (2PACz) anchored 
on indium tin oxide (ITO) [which we used as 
the hole-selective contact (22)], perovskites, 


cence (PL) emission with a longer PL decay 


and Cgp are shown in Fig. 2C. With the OLAI 
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Fig. 2. Optical characterization and energetic alignments of perovskite 
films with and without 2D perovskite passivation. (A) PL images of control 
3D, 2D-TA, and 2D-RT films at wavelength ~570 nm, which corresponds to 
n= 2 layers (images extracted using PHySpecV2). (B) Normalized PL spectra 


of each film from low to high wavelengths. (C) Energy level scheme for the 
control and OLAlI-treated films extracted from UPS data. The VBM was obtained 
as hv — (Ecutott — Eve, min). The position of the CBM with respect to the VBM 
was defined by optical-E, (1.55 eV). 


post-treatment, the secondary electron cutoff 
(Ecutote) Shifted to a higher binding energy, 
indicating that the ion exchange-induced 3D- 
to-2D perovskite phase transition could lower 
the Fermi level (£;) of post-treated perovskite 
films (fig. S6). Notably, the energetic gap be- 
tween Ey and the valence band maximum 
(VBM) of the 2D-RT sample was wider, in- 
dicating the enhanced n-type character of 
post-treated 3D perovskite films, which we 
attribute to a successful 2D perovskite passiva- 
tion strategy (19). The conduction band mini- 
mum (CBM) of 2D-RT films was also closer to 
the CBM of Cgp at the n-type contact, which re- 
sulted in more efficient charge transfer at the 
2D/3D perovskite interface and the Cg electron- 
selective layer. In contrast, the CBM of 2D-TA 
films was much higher than the CBM of Cgo 
with less n-type character and resulted in less 
efficient charge transfer of this 2D/3D perov- 
skite interface at the electron-selective contact. 
In addition, the effects of 2D perovskite cap- 
ping layers also enhanced the resilience against 
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moisture of 3D perovskite films, as shown by 
contact angle measurement in fig. $7. 

Next, we fabricated inverted PSCs with a 
structure of glass/ITO/2PACz/3D perovskite/ 
2D perovskite/Cg./bathocuproine (BCP)/Ag 
(Fig. 3A); Fig. 3B shows the false-colored 
cross-sectional SEM view of these devices. As 
shown in the current density-voltage (J-V) char- 
acteristics of the devices in Fig. 3C, the 2D-RT 
devices demonstrated substantially improved 
PCEs, with a maximum PCE of 24.3% and 
stabilized PCE of ~24% (open-circuit voltage, 
Voc, of ~1.20 V and fill factor, FF, of ~82%; fig. 
S8, A and C). These results represent an abso- 
lute ~2% PCE gain upon 2D-RT passivation 
and can be compared with PCEs for other in- 
verted PSCs (see fig. SOA). 

Also, 2D-RT passivation enables minimiza- 
tion of the device energy loss (Ejoss = Eg - 
aVoc, where q is electron charge and E, is 
the optical bandgap) up to 0.34 eV, which re- 
presents ~96% of the thermodynamic limit of 
the Voc (1.262 V) for E, of 1.55 eV (fig. S8, D 


and E). The reduced nonradiative loss in 2D- 
RT-based devices is comparable with state- 
of-the-art GaAs solar cells (Voc of 1.127 V, 
yielding ~98% of the thermodynamic limit of 
the Voc) (29). The 2D-TA passivated devices 
suffered from lower FF values (<79%; figs. 
S8B and S10), indicative of an energy level 
mismatch at the electron-selective contact, as 
derived from ultraviolet photoelectron spec- 
troscopy (UPS) results (20). The narrow stat- 
istical distribution of the PCE, Voc, FF, and 
short-circuit current density (Jgc) values of the 
devices (figs. S8B and S11) confirmed the high 
reproducibility of our approach. We also con- 
firmed the effectiveness of our approaches by 
showing less than 0.5% deviation for person- 
to-person variations of seven different re- 
searchers (fig. S12A). 

Further, our proposed passivation approach 
was universal for various perovskite com- 
positions (various bandgaps) and deposi- 
tion techniques (such as one-step, two-step, 
and blade-coating), as demonstrated by the 
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Fig. 3. Device performance and stability of 2D/3D perovskite heterojunction. (A) Device architecture of inverted PSCs. (B) Cross-sectional SEM image of inverted cells. 
(C) J-V scan of champion PSCs. (D) Variation of PCEs during damp-heat test of encapsulated devices. The continuous lines are guides to the eye; error bars denote SD. 
(E) Continuous MPPT for the encapsulated control and 2D-RT cells under AM 1.5 illumination in ambient air. Inset: Photograph of the encapsulated device. 


systematic absolute PCE enhancement of 1.5 
to 2.0% of device performance in fig. S12B. 
The reduced trap-assisted recombination of 
2D/3D perovskite heterojunction devices was 
also investigated with transient photovoltage 
decay and light intensity dependent under 
open-circuit conditions (fig. $13). The 2D- 
passivated devices exhibited a longer charge 
recombination lifetime and lower ideality 
factor than control devices, confirming the re- 
duced trap-assisted recombination at 3D/Cgo 
interfaces by the 2D perovskite passivation. 
Finally, we subjected our 2D perovskite 
passivation-treated PSCs to a set of rigorous 
stability tests. First, we evaluated the stability 
of our encapsulated devices when subjected 
to industry-relevant damp-heat tests (fig. S14). 
Here, our 2D-perovskite passivation simul- 
taneously served as ion migration-blocking 


76 1 APRIL 2022 + VOL 376 ISSUE 6588 


moisture/oxygen ingress barriers and as de- 
fect passivation layers, particularly at elevated 
operating temperatures (see fig. S15) (3, 14-17). 
Indeed, the 2D-RT-based device retained >95% 
of the initial PCE (Ty;) after >1200 hours for 
champion stability cells (Fig. 3D). Remarkably, 
after the damp-heat test, three devices showed 
an average PCE of 19.3 + 0.69%. 

Our results represent the successful encap- 
sulation of PSCs passing the industry-relevant 
damp-heat test according to the IEC 61215:2016 
protocols (/, 2). Also, our final PCE of >19% 
after >1000 hours of damp-heat test repre- 
sents a very high retained PCE (fig. S9B and 
table S1). There was no substantial change in 
the structural and optical properties of the 
2D perovskite passivation films (both 3D and 
2D perovskites) after >500 hours of thermal 
annealing at 85°C under dark condition, con- 


firming the robustness of our 2D perovskite 
passivation approach (fig. S16). 

Note that our encapsulated devices used 
for stability tests exhibited slightly lower ini- 
tial PCEs than the unencapsulated devices 
because of Jgc losses originating from the en- 
capsulant and glass sheets (fig. S17). We also 
tested unencapsulated devices in our damp- 
heat chamber, applying thermal tests in am- 
bient air with relative humidity of >50% (figs. 
$18 and S19), representative of extreme out- 
door conditions. Our 2D capping layer intro- 
duced a substantially enhanced resistance of 
the devices against high moisture and thermal 
stress. Finally, we performed maximum power 
point tracking (MPPT) measurements for en- 
capsulated cells under simulated 1-sun illumi- 
nation in ambient air for >500 hours (Fig. 3E). 
Here, the 2D-RT-based devices retained up to 
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~95% of their initial PCE after an MPPT test 
of >500 hours, whereas the control devices 
retained their PCE of <90% for only ~100 hours. 
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GAMMA-RAY ASTRONOMY 


Time-resolved hadronic particle acceleration 
in the recurrent nova RS Ophiuchi 


H.E.S.S. Collaboration* + 


Recurrent novae are repeating thermonuclear explosions in the outer layers of white dwarfs, 

due to the accretion of fresh material from a binary companion. The shock generated when ejected 
material slams into the companion star’s wind can accelerate particles. We report very-high-energy 
[VHE; 2100 giga-electron volts] gamma rays from the recurrent nova RS Ophiuchi, up to 1 month 

after its 2021 outburst, observed using the High Energy Stereoscopic System (H.E.S.S.). The 
temporal profile of VHE emission is similar to that of lower-energy giga—electron volt emission, 
indicating a common origin, with a 2-day delay in peak flux. These observations constrain models of 
time-dependent particle energization, favoring a hadronic emission scenario over the leptonic 
alternative. Shocks in dense winds provide favorable environments for efficient acceleration of 


cosmic rays to very high energies. 


S Ophiuchi (RS Oph) is a recurrent nova 

system composed of a white dwarf and 

a companion red giant star. Novae are a 

source of high-energy particles and pho- 

tons (J, 2), with nonthermal gamma-ray 
emission in the range of ~100 MeV to ~10 GeV 
(3). We adopt a distance of ~1.4 kpc from Earth 
to the RS Oph system (4); analysis of astro- 
metric data suggests larger distances of 2.3 kpc 
(5) or 2.7 kpc (6), but we regard these estimates 
as less reliable because of the binary orbital 
motion in RS Oph. The binary components 
have a separation of 1.48 astronomical units, 
close enough for the white dwarf to continu- 
ally accrete material from its companion star 
(7). At irregular intervals, enough material ac- 
cumulates on the surface of the white dwarf to 
trigger a thermonuclear explosion, driving a 
quasi-spherical shock (8) into the red giant’s 
wind (fig. S2). Eight outbursts were observed 
between 1898 and 2006, recurring at intervals 
of 9 to 26 years (9). 

On 8 August 2021, an outburst of RS Oph was 
identified in optical observations (10), reach- 
ing a peak naked-eye visual magnitude of 4.5, 
more than 1000 times as bright as the quies- 
cent visual magnitude of 12.5. We observed 
RS Oph with the High Energy Stereoscopic 
System (H.E.S.S.), an array of five atmospheric 
Cherenkov telescopes (8). Observations com- 
menced on 9 August 2021 and continued for 
five nights, until 13 August 2021. A higher op- 
tical background due to moonlight prevented 
good quality observations for the next 10 nights. 
During each of the first five nights, H.E.S.S. 
detected point-like gamma-ray emission from 
the direction of RS Oph, with a significance of 
>6o on each night (table $1). The combined 
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data for those five nights are shown in Fig. 1. 
Observations recommenced on 25 August 2021, 
~17 days after the initial outburst. We find 
evidence for a much weaker signal, ~30 
above the background, in ~15 hours (after 
quality cuts) of data accumulated over the 
subsequent 14 days. 

We performed a spectral analysis of the 
H.E.S.S. data for the first five observation nights 
separately. We also separated the data from 
the array of four 106-m? mirror area telescopes 
(designated CT1 to CT4) and a fifth low-threshold 
telescope (CT5) with a mirror area of 612 m? 
(8). We find that the very-high-energy (VHE) 
flux is variable, with a spectral index >3 through- 
out (table S2). 

Figure 2 shows the time evolution of the 
gamma-ray flux from RS Oph for photon en- 
ergies between 250 GeV and 2.5 TeV. The VHE 
gamma-ray flux rises smoothly from To, the 
time of peak optical emission in the V band 
(0), until a VHE peak on the third night of 
observations. The VHE gamma-ray flux then 
decays by an order of magnitude over a 2-week 
period. We obtained 60-MeV to 500-GeV data 
taken by the Fermi-LAT (Large Area Telescope) 
instrument for the same time period as the 
H.ES.S. observations, which are also shown in 
Fig. 2. The Fermi-LAT flux varies in the range 
of ~1 x 10° to 2 x 10°” erg em™ s“, with a peak 
on JT, + 1 day. The VHE gamma-ray emission 
peak is delayed by a further 2 days. 

We used a power-law model ¢™“, with expo- 
nent oa and time ¢, to fit the decay of the flux 
after the peak. For the choice of To = 1 day, the 
best-fitting values are OpzRs5 = 1.43 + 0.18 for H. 
ES.S. and oy ar = 1.31 + 0.07 for Fermi-LAT. We 
find consistent results for the flux and tem- 
poral decay if the Fermi-LAT data are analyzed 
in bins of 24-hour duration (8); the higher 
signal-to-noise ratio of these larger bins enables 
a more detailed spectral analysis. 

The combined H.E.S.S. and Fermi-LAT data 
allow us to measure wide-band gamma-ray 
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spectra over more than four orders of magni- 
tude in energy and to follow their temporal 
evolution (Fig. 3). The RS Oph spectra are con- 
sistent with a log-parabola model. Comparison 
of spectra taken on different nights shows a 
general trend for the flux normalization to de- 
crease and the parabola to widen over time (8) 
(table S2). 

The similarity between the spectra of the 
Fermi-LAT and H.E.S.S. data and their similar 
decay profiles after their respective peaks indi- 
cate a common origin for the gamma rays from 
1 day to 1 month after the explosion (supple- 
mentary text). We assume that the gamma rays 
are emitted by particles accelerated at the ex- 
ternal shock, as it propagates into the wind of 
the red giant (fig. $2). Optical spectroscopic mea- 
surements of the 2021 nova indicate shock ve- 
locities (wg,) ranging from 4000 to 5000 km s? 
(1D, compatible with measurements from the 
previous 2006 outburst of RS Oph (22, 73). 
High-resolution images of the 2006 event 
(14) indicated that the polar regions of the 
shock expanded at ~5000 km s-! over the first 
5 months. We therefore assume that, during 
the first week after the 2021 outburst, the shock 
velocity did not fall below several thousand 
kilometers per second. 

The images of the 2006 nova showed a quasi- 
spherical outflow, pinched at an equatorial 
ring (14, 15). This is consistent with a shock 
expanding into the wind of the red giant, or- 
thogonal to the orbital plane of the binary but 
inhibited close to the plane by the denser gas 
(7, 16). We expect particles to undergo diffusive 
shock acceleration at the external fast-moving 
shocks, above and below the orbital plane of 
the binary. We consider two scenarios to ex- 
plain the observed spectral and temporal prop- 
erties of the gamma-ray emission from RS Oph: 
(i) gamma-ray production from accelerated 
protons colliding with dense gas in the down- 
stream volume (hadronic x° decay model) and 
di) gamma-ray production from energetic 
electrons scattering low-energy photons from 
the nova (leptonic inverse Compton model). 
For both models, the observations place strong 
constraints on the physical conditions, partic- 
ularly the acceleration efficiencies required 
to match the measured fluxes and maximum 
photon energies (supplementary text). 

VHE gamma-ray emission requires accel- 
eration of particles to energies above the tera- 
electron volt level. The maximum energy a 
particle attains at a shock is determined either 
by the time taken before radiative cooling domi- 
nates over acceleration or by when particles 
become too energetic and thus escape upstream 
of the shock (17). This confinement limit ap- 
plies when the accelerating particles are un- 
able to excite magnetic field fluctuations to a 
sufficient level ahead of the shock. Because par- 
ticles spend more time diffusing upstream than 
downstream, the details of the downstream 
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magnetic fields can be neglected (78). For up- 
stream magnetic field amplification to be ef- 
fective, a sufficient flux of particles, typically 
protons, must escape upstream of the shock. 
This requires both efficient transfer of the shock 
kinetic energy to relativistic protons (i.e., those 
traveling at speeds close to the speed of light) 
and that a fraction of these protons penetrate 
far upstream. Escaping particles have energies 
concentrated close to the maximum particle 
energy; less-energetic particles are confined 
to the shock. The escaping flux per unit area 
at a given shock radius can be parameter- 
ized as 


Jese = Ce eseFe / Emax (1) 


where ¢ is the elementary charge, F. = 5 p,)Ué, 
is the energy flux density for a high-Mach- 
number nonrelativistic shock, p,,, is the im- 
mediate upstream gas density at that radius, 
Emax is the maximum particle energy that 
dominates the escaping flux, and the effici- 
ency parameter &,,, depends on the assumed 
particle spectrum (supplementary text). For a 
wind-like density profile, and neglecting ra- 
diative losses, the confinement limit on the 
maximum energy for a particle with atomic 
number Z, and thus charge |Z]e, is 


: 1/2 
g SC M/ewina 
Emax = 1.5|2Z| (= ‘na 


2 
(aca) TeV (2) 
5000 kms 


where M and vwing are the mass-loss rate and 
the wind velocity, respectively, of the red giant. 
Esse is predicted to be ~1% for high-Mach- 
number shocks (17). For RS Oph, M/owina = 
6 x 10" kg m“! (15), which, together with the 
inferred shock velocities, indicates a maxi- 
mum energy Emax ~ 10 TeV. This is compatible 
with the measured maximum photon energies 
Ey max * 1 TeV (Fig. 3). 

In the hadronic scenario, the gamma-ray 
light curves are consistent with an expanding 
shock in a decreasing density profile. With the 
adopted distance of 1.4 kpc (4), the measured 
gamma-ray fluxes require that >10% of the 
post-shocked medium’s internal energy goes 
into accelerating protons or other nuclei. The 
delay between the peaks in the Fermi-LAT and 
H.ES.S. light curves would then reflect the 
finite acceleration time of the >1 TeV protons 
or, more specifically, the time taken to pop- 
ulate the high-energy tail of the distribution 
(79). A simple calculation of the acceleration 
time (supplementary text), on the basis of a 
comparison of the confinement and Hillas 
limits, implies that this should happen on 
the order of days. This is consistent with the 
spectral evolution seen in Fig. 3: a reduction in 


the Fermi-LAT flux, accompanied by a hard- 
ening in H.E.S.S. flux and increased Ey max Over 
the first few days after the outburst. Attenua- 
tion of gamma-rays, due to the nova’s optical 
and infrared photon fields, is minor below 
1 TeV a few hours after the explosion; there- 
fore, attenuation alone cannot account for the 
observed hardening (supplementary text and 
fig. S10). 

For the alternative leptonic scenario, in which 
tera-electron volt gamma-rays are produced 
by VHE electrons, the acceleration needs to 
overcome the strong radiative losses due to 
inverse Compton cooling in the strong photon 
fields of the nova, as well as synchrotron cool- 
ing in the magnetic field in the shock region. 
To achieve this, electrons must accelerate at 
close to the Bohm rate—i.e., the scattering rate 
equal to the rate of gyration in the magnetic 
field (78). Such efficient scattering requires 
strong self-generated magnetic fluctuations up- 
stream of the shock, which implies the presence 
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Fig. 1. RS Oph significance maps. Significance 
maps derived from the H.E.S.S. >100-GeV gamma- 
ray observations for the early (A) and late (B) phases 
of the RS Oph 2021 outburst. To is the time of 
peak optical emission, modified Julian day 59435.25. 
The dashed white circles indicate the H.E.S.S. point 
spread function (PSF). 
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Fig. 2. Gamma-ray light curves of RS Oph. Light curves of gamma-ray emission from RS Oph, including 
data from Fermi-LAT and H.E.S.S. observations. The H.E.S.S. data (red squares) cover a period of five nights, 
after which observations paused for 10 days (shaded gray band) owing to bright moonlight and then 
recommenced for a period of 14 days. The H.E.S.S. flux is integrated from 250 GeV to 2.5 TeV, whereas the 
Fermi-LAT flux is integrated from 60 MeV to 500 GeV. Fermi-LAT data are shown in 6-hour bins (bright 
blue circles) corresponding to the time windows of the H.E.S.S. observations; data collected outside of these 
times are shown with semitransparent markers. Error bars denote lo statistical uncertainties. A power-law 
model was fitted to the temporal decay after the time of peak flux for both instruments (red and blue 
dashed lines, with uncertainties indicated by the shaded regions). The vertical dotted black line indicates To, 


the peak of the outburst in the optical waveband. 


of an energetic relativistic hadronic compo- 
nent. In this scenario, the differences between 
the spectral slopes in the Fermi-LAT and H.E.S.S. 
energy ranges are a consequence of the energy- 
dependent cooling rates in time-dependent 
photon fields. Electrons that radiate in the 
VHE band cool on a time scale less than the age 
of the nova remnant at the times the observa- 
tions were taken, whereas lower-energy un- 
cooled electrons accumulate downstream over 
time. The Fermi-LAT light curve in this sce- 
nario then reflects the evolution of the energy 
density of soft-photon targets, whereas the 
H.E.S.S. light curve traces the full radiative 
output of high-energy electrons up to the VHE 
peak (supplementary text). After the peak, owing 
to the rapidly decreasing photon energy den- 
sity, the cooling time increases faster than the 
remnant’s age, and the VHE emitting electrons 
are also slow cooling. 

In this time-dependent numerical single-zone 
model, parameters can be found to approx- 
imately describe the light curves and spectra 
in both leptonic and hadronic scenarios, pro- 
viding quantitative estimates for the accelera- 
tion efficiencies at early times ¢ (¢ < Ty + 5 days). 
We are unable to account for the temporal de- 
cay at later times, probably because we do not 
consider the complex internal structure of the 
nova remnant, its nonspherical geometry, or 
the escape of particles from the emission zone. 
Both the leptonic and hadronic models at early 
times are consistent with continuous injec- 
tion of particles following a power law spec- 
trum in energy E *” with a high-energy cutoff. 
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To approximately match the observed flux, 
the hadronic model requires >10% of the 
shocked gas’s internal energy to be transferred 
to nonthermal protons, whereas the leptonic 
model requires >1% efficiency for nonthermal 
electrons. 

Such a high fraction of the total energy in 
nonthermal electrons is inconsistent with 
theories of injection at high-Mach-number 
shocks, for which the ion injection efficiency 
is expected to be much higher than that of 
electrons (20). Numerical simulations of high- 
Mach-number shocks find a ratio of <10~? for 
electron to ion energy densities (27), consistent 
with multiwavelength models of supernova 
remnants (22). A >1% efficiency of conver- 
sion to nonthermal electrons cannot be re- 
alized in a purely leptonic model. For this 
reason, we prefer the hadronic scenario dis- 
cussed above, for which both the implied high- 
proton-acceleration efficiencies and inferred 
maximum energy are in line with theoretical 
predictions (17). Our findings support previ- 
ous hadronic models of gamma-ray novae 
(23-25). 

The VHE detection of RS Oph demonstrates 
that particle acceleration to energies around 
the tera-electron volt level can occur within the 
dense wind environments of recurrent novae. 
The total kinetic energy from each nova of 
RS Oph is estimated to be ~10“ erg [10~” solar 
masses (Mo) of ejecta at ~4000 km s' (13)], 
with a large fraction of this being converted to 
relativistic protons and heavier nuclei, which 
are the main constituents of Galactic cosmic 
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Fig. 3. RS Oph gamma-ray spectra. H.E.S.S. and 
Fermi-LAT spectra for 9 August (green) and 

13 August (orange), fitted with a log-parabola model. 
The analysis is applied separately for the H.E.S.S. 
CT1 to CT4 (squares) and CT5 (triangles) detectors 
8). The Fermi-LAT data (open circles) are 
integrated over 24 hours centered at the H.E.S.S. 
observation times. There is spectral evolution from 
9 to 13 August, with a reduction in the Fermi-LAT 
flux and an increase in the maximum energy of the 
tera-electron volt spectrum. Error bars denote 

lo statistical uncertainty, and upper limits are 95% 
confidence levels. 
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rays. Each nova event generates enough cos- 
mic rays to fill a cubic parsec volume with an 
energy density of ~0.1 eV cm ®, similar to the 
local Galactic cosmic-ray energy density of 
~(0.8 to 1.0) eV cm™ (26) sustained by super- 
novae. In the case of RS Oph, the cosmic-ray 
energy input recurs approximately every At = 
15 to 20 years, leading to an almost-continuous 
injection of nonthermal particles. For a diffu- 
sion coefficient Din the neighborhood of RS Oph, 
the cosmic-ray output from each nova is spread 
over a diffusion length (air = V4DAt. Using 
a Galactic average D ~ (3 — 5) x 108 cm? s“! 
(27), we find aig > 1 pe, and the contribution 
from each nova is less than the average Ga- 
lactic cosmic-ray population. If the diffusion 
coefficient in the neighborhood of the nova is 
much lower than the Galactic average, per- 
haps due to enhanced turbulence after pre- 
vious novae, such a sustained source of cosmic 
rays will raise the local abundance. If efficient 
acceleration of particles to tera-electron volt 
energies in recurrent novae is commonplace, 
with spectral energy distribution harder than 
that of the Galactic cosmic-ray background 
ocE~?7, the local contribution from novae 
would dominate at tera-electron volt energies 
over volumes above the cubic parsec level. 
The size of the affected region will depend 
on the value of the diffusion coefficient, which 
can be constrained with measurements of 
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the diffuse gamma-ray emission at energies 
~10 to 100 GeV (28). 

Our time-resolved gamma-ray emission mea- 
surements have implications for the origin of 
cosmic rays. Acceleration of cosmic rays to 
peta-electron volt energies in supernova rem- 
nants requires substantial amplification of 
magnetic fields. Fast shocks (~10,000 km s~’) 
propagating through the dense winds (M / 
Vwind ~ 10 kgm?) associated with the pro- 
genitors of supernova remnants from massive 
(28 M,) stars provide the only known environ- 
ments where the required conditions can (in 
theory) be met (17, 29). However, observational 
confirmation of this prediction has not been 
found. The detection of VHE gamma-rays from 
RS Oph provides an example of a Galactic ac- 
celerator reaching the theoretical limit for the 
maximum achievable particle energy via dif- 
fusive shock acceleration (17). If our results can 
be extrapolated to the most optimistic super- 
nova conditions, they support the prevailing 
model of Galactic peta-electron volt cosmic 
rays originating in supernova remnants from 
massive stars (17, 29). 
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PALEONTOLOGY 


Brawn before brains in placental mammals after the 


end-Cretaceous extinction 


Ornella C. Bertrand", Sarah L. Shelley”, Thomas E. Williamson?, John R. Wible?, 
Stephen G. B. Chester*®®, John J. Flynn”®*?°, Luke T. Holbrook”, Tyler R. Lyson’2, 
Jin Meng’, lan M. Miller’?"2, Hans P. Piischel’, Thierry Smith"*, Michelle Spaulding”, 


Z. Jack Tseng”®, Stephen L. Brusatte’** 


Mammals are the most encephalized vertebrates, with the largest brains relative to body size. Placental 
mammals have particularly enlarged brains, with expanded neocortices for sensory integration, the 
origins of which are unclear. We used computed tomography scans of newly discovered Paleocene fossils 
to show that contrary to the convention that mammal brains have steadily enlarged over time, early 
placentals initially decreased their relative brain sizes because body mass increased at a faster rate. 
Later in the Eocene, multiple crown lineages independently acquired highly encephalized brains 
through marked growth in sensory regions. We argue that the placental radiation initially emphasized 
increases in body size as extinction survivors filled vacant niches. Brains eventually became larger 


as ecosystems saturated and competition intensified. 


ammals have the largest brains, abso- 
lutely and relative to body size, of all 
vertebrates, reaching an extreme in the 
hyperinflated brain of humans (/, 2). 
The mammalian brain was assembled 
over 200 million-plus years of evolution, be- 
ginning with encephalization pulses—increases 
in brain size relative to body size—on the 
mammal stem lineage in the Mesozoic. These 


increases were associated with heightened 
olfaction and the origin of the neocortex, a 
new part of the cerebrum involved in higher 
cognition and sensory integration (3). Among 
extant mammals, the more than 6000 species 
of placentals (4) exhibit remarkable diversity 
of encephalization (5) and sensory abilities 
(6), the result of complex interplay between 
brain and body size changes over time and 
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across taxa (7). It has long been recognized 
that the earliest fossils of extant placental 
orders (herein “placental crown orders”) from 
the Eocene (56 million to 34 million years ago) 
had brains similar in structure to (8, 9), but 
smaller than, those of their modern-day coun- 
terparts (J0, 11) and that relative brain size 
generally increased over the Cenozoic (72, 13). 
However, much about the origin of the placen- 
tal brain, and when and how it encephalized to 
modern levels, remains unclear. 

In particular, little is known about the tran- 
sition from the ancestral brains of Mesozoic 
mammals to the more modern brains of 
Eocene crown placentals. The principal gap in 
understanding is the Paleocene (66 million to 
56 million years ago), the interval after the 
end-Cretaceous mass extinction, when placen- 
tals and close kin radiated into niches vacated 
by dinosaurs (14), ballooned in body size 
(15, 16), and inaugurated the Age of Mammals 
(17, 18). Jerison (1) posited that the “archaic” 
placentals replacing dinosaurs (herein “pla- 
cental stem taxa,” those that do not clearly 
belong to extant orders) were overgrown ver- 
sions of Mesozoic mammals, which passively 
increased their absolute brain sizes to keep 
pace with body size expansion and then, later 
in the Eocene, actively encephalized to increase 
relative brain sizes. This hypothesis was based 
on a small sample of skulls, of uncertain 
Paleocene or Eocene age, whose brain cavities 
were measured with basic volumetric tech- 
niques unable to distinguish sensory regions. 
Recently, a broad study of mammals instead 
identified increases in relative brain size after 
the end-Cretaceous extinction (7), but this 
alternative hypothesis was inferred from 
phylogenetic comparative data that did not 
directly include Paleocene fossils. Testing these 
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competing hypotheses has proven difficult be- 
cause well-preserved Paleocene mammal skulls 
have been notoriously rare. 

We assessed the early evolution of the pla- 
cental brain with an expansive dataset of 
Mesozoic and Cenozoic mammals, including 
newly discovered Paleocene skulls from the 
San Juan Basin of New Mexico (19) and Denver 
Basin of Colorado (20). To do so, we used high- 
resolution computed tomography (CT) to mea- 
sure the size of the brain and its individual 
sensory components (neocortex, olfactory 
bulbs, and cerebellar petrosal lobules involved 
in control of eye movements) (Fig. 1) (27). Our 
dataset includes 34 new CT scans of Paleogene 
taxa (17 Paleocene and 17 Eocene) (table S1), 
alongside data from previous work. We exam- 
ined patterns over time and across taxa to quan- 
tify body size, brain size, and sensory regions 
change during the placental radiation and 
after the end-Cretaceous extinction. This en- 
abled us to test the competing hypotheses for 
placental encephalization and more broadly 
to investigate when and how the modern pla- 
cental brain and sensory repertoire emerged, 
and what role the end-Cretaceous extinction 
played. 

We found that Mesozoic and Paleocene 
mammals had smaller brains relative to their 
body mass than those of Eocene stem taxa and 
crown orders (Fig. 1D). When expressed as a 
phylogenetic encephalization quotient (PEQ), 
a measure of relative brain size compared with 
its value predicted by allometry and phylogeny 
(22), Paleocene forms were less encephalized 
than Mesozoic and Eocene taxa (Fig. 2, A and 
B, and table $2). After the Paleocene low in 
PEQ, both average PEQ and variance signif- 
icantly expanded in the Eocene, in both stem 
taxa and crown orders (Fig. 2, A and B). These 
results are robust to phylogenetic uncertainty 
(figs. S1 and S2 and tables S3 to S5), differences 
in body mass estimates (figs. S3 to S9 and 
tables S6 to S8), and whether placentals orig- 
inated in the Cretaceous or immediately after 
the end-Cretaceous extinction (fig. $10). Our 
results are broadly in line with Jerison’s “body- 
before-brain” hypothesis rather than the alter- 
native but reveal an unexpected wrinkle. The 
earliest placentals were neither scaled up ver- 
sions of Mesozoic mammals nor did they en- 
cephalize rapidly. Instead, their relative brain 
sizes actually decreased as they increased in 
body sizes more than in brain size while ra- 
diating after the end-Cretaceous extinction. 

Proportions of sensory regions also changed 
markedly over time. The olfactory bulbs and 
petrosal lobules of Paleocene species and 
Eocene placental stem taxa essentially main- 
tained their Mesozoic sizes, relative to both 
body mass and brain volume (endocranial 
volume) (Fig. 3, fig. S11, and table $2). The 
olfactory bulbs became a significantly smaller 
component of the brain, and the petrosal 


lobules became a larger component, in Eocene 
crown orders (Fig. 3, fig. S11, and table S2). 
Additionally, the neocortex enlarged over 
time and covered significantly more of the 
brain surface in Eocene crown orders versus 
Paleocene and Eocene stem forms (Fig. 3 and 
table S2). Thus, the Eocene increase in relative 
brain size was underpinned by expansions 
of the petrosal lobules and especially the neo- 
cortex, but not the olfactory bulbs. Most of 
these changes occurred in the Eocene crown 
orders, helping explain why they had sig- 
nificantly higher PEQs than those of their 
Eocene stem contemporaries, despite insig- 
nificant differences in body mass among them 
(Fig. 2 and table $2). These results are also 
robust to phylogenetic uncertainty and pla- 
cental origin timing (figs. S12 to S17 and 
tables S3 to S5). 

Phylogenetic context helps untangle the 
trends and tempo of these changes. Both body 
mass and brain volume significantly increased 
from the Mesozoic to the Paleocene (Fig. 2, D 
and F, and table S9), as did the rate of change 
in both measures on the phylogeny (figs. S18 
and S19), regardless of placental origin time 
(fig. S20). However, during the Paleocene, the 
majority of branches exhibited faster rates of 
body mass increase than brain volume increase 
(fig. $21), and although this largely held in the 
Eocene, crown orders accelerated to faster rela- 
tive increases in brain volume rate compared 
with those of Eocene stem and Paleocene taxa. 
This discrepancy explains why there was a 
drop in PEQ in the Paleocene, followed by 
relative brain expansion in the Eocene. 

Rates of change in body mass, brain vol- 
ume, and PEQ were stable across most of 
the Mesozoic but increased dramatically at 
or near the origin of Placentalia, whether it 
occurred in the Cretaceous (Fig. 4B and figs. 
$18 and S19) or immediately after the end- 
Cretaceous extinction (figs. S20 and S22). This 
makes it difficult to differentiate the effects of 
the extinction itself, but because both dating 
approaches place the key pulses at or near the 
origin of Placentalia, this implies that funda- 
mental changes to the mammalian brain and 
its allometric relationships happened around 
this event, perhaps associated with changes in 
metabolic rate and reproductive style (23). The 
body mass, brain volume, and PEQ rate in- 
creases for early Placentalia were not followed 
by stasis or decline; rather, high rates continued 
through the Paleocene and Eocene. Brain 
evolutionary rate increased in placentals, al- 
though it would take until the Eocene for those 
persistently high rates to assemble greatly 
encephalized modern placental brains with 
relatively large neocortices and petrosal lobules 
and small olfactory bulbs (figs. S23 to S26). 

Phylogenetic character mapping reveals 
lineage-specific changes in body, brain, and 
sensory region sizes. The Paleocene decline 
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Fig. 1. Virtual endocasts of Cenozoic mammal exemplars used in this study with sensory regions highlighted and phylogenetic generalized least squares (PGLS) 
regressions for Mesozoic and early Cenozoic mammals. (A) Late Paleocene stem placental condylarth Arctocyon primaevus (IRSNB M2332). (B) Middle Eocene stem 
placental tillodont Trogosus hillsii (USNM 17157). (C) Middle Eocene crown placental perissodactyl Hyrachyus modestus (AMNH FM 12664). (D) Phylogenetically corrected PGLS 
regression of logio(Endocranial volume) versus logiq(Body mass). (E) Phylogenetically corrected PGLS regression of PEQ versus logio(Body mass) and graphical summary of the 
results. The petrosal lobules are absent in (B) and (C). Scale bar, 10 mm. Body mass was measured in grams, and endocranial volume was measured in cubic centimeters. 
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Fig. 2. PEQ, endocranial volume, and body mass of Mesozoic and early Cenozoic mammals. (A) Boxplot of PEQ for Mesozoic, Paleocene, Eocene stem taxa, and 
Eocene crown orders. (B) PEQ average through time per 10-million-year bins. (©) Boxplot of logio(Body mass). (E) Boxplot of logio(Endocranial volume) for groups in (A). 
(D) Logio(Body mass) average through time per 10-million-year bins for groups in (A). (F) Logio(Endocranial volume) average through time per 10-million-year bins. 
Body mass was measured in grams, and endocranial volume was measured in cubic centimeters. 
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in PEQ was the net result of several indepen- 
dent decreases in relative brain volume, largely 
because of independent increases in body 
mass in archaic groups such as “condylarths,” 
pantodonts, and taeniodonts (Fig. 4 and figs. 
S18 to $20, S22, and S27 to S29). Then in the 
Eocene, there were several independent in- 
creases in PEQ, particularly in crown orders 
such as artiodactyls (including cetaceans), 
perissodactyls, carnivorans, euprimates, and 
rodents (Fig. 4 and figs. S18 to S20, S22, and 
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$27 to S29). Omnivores and carnivores were 
statistically indistinguishable in PEQ com- 
pared with contemporary herbivores in the 
Paleocene, but both guilds significantly in- 
creased PEQ in the Eocene, with omnivores 
and carnivores eclipsing herbivores (fig. 
S30 and table S10). The two most speciose 
placental subclades experienced different 
fates: Raw body mass increased markedly in 
Paleocene-Eocene laurasiatherians (carnivorans, 
perissodactyls, and artiodactyls) but not in 
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euarchontoglirans (rodents and primates). 
Both subclades underwent substantial enceph- 
alization (Fig. 4), but rates of change in both 
body mass and brain volume were higher in 
laurasiatherians (Fig. 4 and figs. S18 to S20 
and S22). 

Small relative brain size after the end- 
Cretaceous extinction—even more so a decline 
in encephalization—is surprising and counter 
to the convention that mammal brains have 
steadily gotten larger over time [(10-13), but 
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Fig. 3. Relative size of the olfactory bulbs, petrosal lobules, and neocortex 
of Mesozoic and early Cenozoic mammals. (A) Boxplot of the residuals from a 
PGLS regression of logio(Olfactory bulb volume versus Endocranial volume) for 
Mesozoic, Paleocene, Eocene stem taxa, and Eocene crown orders. (B) Boxplot of 
the residuals from a PGLS regression of logio(Petrosal lobule volume versus 
Endocranial volume) for the groups in (A). (C) Boxplot of the residuals from a 
PGLS regression of logio(Neocortical surface area versus Endocranial surface 
area) for the groups in (A). (D) Residuals of logjo(Olfactory bulb volume versus 
Endocranial volume) average through time per 10-million-year bins. (E) Residuals 
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of logio(Petrosal lobule volume versus Endocranial volume) average through time 
per 10-million-year bins. (F) Residuals of logio(Neocortical surface area versus 
Endocranial surface area) average through time per 5-million-year bins. (G@) PGLS 
regression of logiq(Olfactory bulb volume versus Endocranial volume) for the 
groups in (A). (H) PGLS regression of logio(Petrosal lobule volume versus 
Endocranial volume) for the groups in (A). (I) PGLS regression of logio 
(Neocortical surface area versus Endocranial surface area) for the groups in (A). 
Mesozoic neocortical data were unavailable. Volumetric measurements are in 
cubic millimeters, and surface area measurements are in square millimeters. 
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Fig. 4. PEQ evolutionary change across time and phylogeny in Mesozoic 
Eocene mammals. (B) PEQ evolutionary rate mapped onto a phylogenetic 


and early Cenozoic mammals. (A) PEQ density plot and ancestral state 
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tree and averaged through time per 10-million-year bins. Virtual endocasts 
illustrated to the right with highlighted brain regions are scaled in terms 

of their PEQ and are, from top to bottom, Arctocyon primaevus (IRSNB 
M2332), Hyopsodus paulus (USNM 17980), Meniscotherium chamense 
(USNM 22673), Trogosus hillsii (USNM 17157), Acmeodon secans (KU 7912), 


see (7)]. Although relative brain size is but 
one aspect of cognition, and we lack infor- 
mation on neuron density and connectivity 
in virtually all fossil mammals, experimental 
evidence indicates that extant mammals with 
relatively larger brains are better problem 
solvers (24). There also is ample evidence that 
larger relative brain size is linked to greater 
behavioral flexibility and capacity to cope with 
new or altered environments (25-28). All of 
these skills, presumably, would have been 
beneficial after a mass extinction, yet the 
placentals that diversified as devastated eco- 
systems recovered and new food webs emerged 
did not develop relatively larger brains, which 
suggests that other factors were behind their 
radiation. 

It appears that the postextinction placental 
radiation emphasized changes to body size 
rather than brain size, as survivors proliferated 
to fill vacant niches, no longer constrained to 
shrew to badger sizes by incumbent dinosaurs 
(15). Brain size increased as a result of body 
size expansion, and because there were no 
marked relative increases in key sensory re- 
gions (olfactory bulbs and petrosal lobules), 
the growth was primarily in regions that 
permit control of larger bodies (brain stem, 
diencephalon, and striatum) (29, 30). This 
suggests that selection acted differently on 
brain and body size and adds to growing 
evidence that body size shifts, rather than 
pronounced alterations in brain size, drove 
much of the variation in mammalian enceph- 
alization (7, 37). At the very least, relatively 
enlarged brains were not necessary for Paleo- 
cene placentals. 

As the Paleocene transitioned to the Eocene, 
the mode of placental brain and body evolu- 
tion shifted. Relative brain size greatly increased, 
in both average and variance, signaling a new 
regime in which brain size changes were more 
paramount than those in body size. Brains 
were not only becoming larger; growth was 
focused in regions involved in advanced senses 
as these mammals added better balance, vision, 
eye movement, head control, and sensory in- 
tegration to their preexisting keen olfaction, 
thus greatly expanding the placental sen- 
sory toolkit. This corroborates arguments that 
mosaic evolution of brain regions, not merely 
changes in overall size, underlie the adaptive 
potential of the mammalian brain—not only 
across phylogeny (32) but also in deep time. 
Primarily, encephalization and sensory en- 
hancements occurred in the crown orders, 
with predators developing significantly larger 
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relative brain size than those of prey groups 
as ecosystems saturated. It was these crown 
placentals—among them the first horses, whales, 
dogs, bats, and euprimates—that achieved high 
PEQs, whereas the smaller-brained and more 
olfactory-driven stem taxa waned, with groups 
once so instrumental in the end-Cretaceous 
recovery and initial placental radiation—such 
as condylarths and pantodonts—ultimately 
succumbing to extinction. 
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CHRONIC PAIN 


A spinal microglia population involved in remitting 


and relapsing neuropathic 


pain 


Keita Kohno*+, Ryoji Shirasaka*{, Kohei Yoshihara“, Satsuki Mikuriya’, Kaori Tanaka®, 
Keiko Takanami**, Kazuhide Inoue®, Hirotaka Sakamoto‘, Yasuyuki Ohkawa’, 


Takahiro Masuda??, Makoto Tsuda’2* 


Neuropathic pain is often caused by injury and diseases that affect the somatosensory system. Although 
pain development has been well studied, pain recovery mechanisms remain largely unknown. Here, 

we found that CD11c-expressing spinal microglia appear after the development of behavioral pain 
hypersensitivity following nerve injury. Nerve-injured mice with spinal CD11c* microglial depletion failed 
to recover spontaneously from this hypersensitivity. CD1lc* microglia expressed insulin-like growth 
factor-1 (IGF1), and interference with IGF1 signaling recapitulated the impairment in pain recovery. In 
pain-recovered mice, the depletion of CD11c* microglia or the interruption of IGF1 signaling resulted in a 
relapse in pain hypersensitivity. Our findings reveal a mechanism for the remission and recurrence of 
neuropathic pain, providing potential targets for therapeutic strategies. 


amage to the nervous system alters the 
somatosensory system, which can result 
in the development of neuropathic pain. 
These alterations occur in neurons and 
in non-neuronal cells, especially microg- 
lia, a type of resident immune cells. In response 
to peripheral nerve injury (PNI, a model of 


Fig. 1. CD1ic* spinal microglia are 
necessary for the remission of pain 
hypersensitivity after PNI. (A) Venus 
fluorescence in the SDH of Itgax-Venus 
mice 14 days after PNI. (B to E) Venus and 
cell-type marker immunostaining: (B) 
IBA1/CD11b, myeloid markers including 
microglia; (C) glial fibrillary acidic protein 
(GFAP), an astrocyte marker; (D) 
adenomatous polyposis coli (APC), an 
igodendrocyte marker; and (E) neuronal 
uclei (NeuN), a neuron marker. (F) Venus 
nd tdT fluorescence in the SDH of 
lexb'"™"< ttgax-Venus mice on day 14. 
hown is the percentage of tdT* cells per 
total Venus* cells (n = 4 mice). (G to 

J) Temporal changes in the number of 
CD11c"? (H), CD11c* (1), and CD11c"*" (J) 
spinal microglia after PNI (n = 3 to 5 mice) 
(flow cytometry). (K and L) EGFP in the 
SDH of Itgax-DTR-EGFP mice 2 days 

after intrathecal injection of PBS or DTX 
(0.5 ng) on day 14 (K). Also shown is flow 
cytometric quantification of CD11c* (EGFP) 
microglia (n = 5 mice) (L). (M) PWT of 
Itgax-DTR-EGFP mice injected with PBS or 
DTX before (Pre) and after PNI (n = 7 to 
8 mice). Scale bars, 200 um [(A) and (K)] 
and 20 um (B). Data are shown as means 
+ SEM. **P < 0.01 and ****P < 0.0001. 
For more details, see the extended caption 
in the supplementary materials. 
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neuropathic pain), microglia in the spinal 
dorsal horn (SDH) respond rapidly, with al- 
terations in morphology, cell number, and 
gene expression levels, processes that are nec- 
essary for pain development (J, 2). However, 
microglia are highly plastic cells that exist in 
heterogeneous states characterized by distinct 
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gene expression, and microglial populations 
can change during the course of neuronal de- 
velopment and disease (3-5). 

Here, we focused on microglia characterized 
by the expression of integrin aX (ITGAX, also 
known as CD1Ic) (6) because CD11c* microglia 
in the brain increase in several neurological 
disease models (7). By using ligax-Venus mice, 
in which CD11c* cells express the fluorescent 
protein Venus, we found that Venus” cells in 
the SDH were rarely observed in naive and 
sham-operated mice (fig. S1A) but increased 
after PNI (Fig. 1A and fig. SIB). Venus expres- 
sion was highly restricted to cells with ionized 
calcium-binding adapter molecule 1 (IBA1) 
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and CD11b (Fig. 1, B to E), the purinergic 
receptor P2Y12R (3, 7) (fig. SIC), and tandem 
Tomato (tdT) (Fig. 1F) in Hexb'*"/4T mice, 
which enables the visualization of microglia 
(8). Venus” cells were observed in a portion of 
tdT” (Fig. IF) and P2Y12R* microglia (fig. S1C). 
In addition, data from bone marrow-chimera 
mice (fig. S1, D and E) and splenectomized 
mice (fig. SIF) excluded an involvement of 


peripherally derived cells in the appearance 
of Venus" cells in the SDH after PNI. Thus, 
we hereafter referred to Venus* SDH cells as 
CD11c* microglia. Flow cytometry analyses 
(Fig. 1, G to J) revealed that CD11c* microglia 
peaked at day 14 after PNI and were observed 
until the last time point tested (day 56) (Fig. 
1I), a temporal pattern that was in contrast 
with CD11c"™ microglia (Fig. 1H). Moreover, 


CD1ic"" microglia (Fig. 1G and fig. S1G) 
peaked on day 35 after PNI (Fig. 1J), atime 
point that the paw withdrawal threshold 
(PWT) to mechanical stimuli returned to the 
basal level (fig. SIH). Thus, CD11c* microglia 
appeared in the SDH after the development 
of PNI-induced pain hypersensitivity and 
remained even after the hypersensitivity had 
remitted. 
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Fig. 2. IGFlis a CD11c* microglial factor required for recovery from pain 
hypersensitivity. (A) Multidimensional scaling plot representing relationships 
between CD11c"®? and CD11c"2" microglia (n = 4 to 5 mice). Each point represents a 
single sample. (B) Heatmap of the top 200 DEGs in microglia for the different 


Post-PNI (days) 


Post-PNI (days) 


and CD11c"®" SDH microglia before (Naive) and after PNI (n = 4 mice). (E) In 
situ hybridization of mRNAs of Venus, Igfl, and P2ry12 in the SDH on day 35. 

Scale bar, 20 um. (F to H) PWTs of Itgax-Cre:Igfl"*’"™ mice (n = 5 mice) (F), 
Cx3erl ERT? Jgf1"X/"0x mice (n = 8 or 9 mice) (G), and IGF1-neutralizing 


conditions. Each row represents a single sample. DEGs are listed in table S1. 

(C) Volcano plot showing the average log2-fold change and —log10 false discovery 
rate (FDR) for all DEGs between CD1lc"? and CD11c"" microglia on day 35. DEGs 
are listed in table S2. (D) qPCR analysis of Igfl mRNA in FACS-sorted CD11c"* 
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antibody (anti-IGF1#1)-treated WT mice (n = 6 to 7 mice) (H). (I) Reversal of PNI- 
induced pain hypersensitivity by intrathecal injection of IGF1 (n = 6 to 8 mice). 

Data are shown as mean + SEM. *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 
0.0001. For more details, see the extended caption in the supplementary materials. 
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Fig. 3. Appearance of CD11c* 
SDH microglia involves 

AXL. (A) Venus in the SDH of 
Itgax-Venus mice on day 14 
after injection of CTB-saporin 

or 1B4-saporin. (B) CD11c* 
microglia after saporin injection 
(n = 3 to 4 mice) (flow cytom- 
ry). (C) Phagocytosis of 

yelin debris by cultured spinal 
icroglia from Itgax-Venus 

ice with PNI (day 14). Percent- 
age of fluoromyelin* microglia 

(n =7 wells). (D) IBA1, CD68, and 
BP and Venus in the SDH on 
day 14. IBA1* cells including 
multiple MBP signals within 
CD68* lysosomes were counted 
(n = 4 to 5 mice). (E) Electron 
microscopy image of Venus 
(yellow) and myelin particles (blue) 
in the SDH (day 14). (F) Venus 

in the SDH on days 14 (left) 

or 7 (right) after intraspinal 
injection of myelin debris. 

(G) Ax! mRNA in sorted CD11c"*? 
and CD11c"8" SDH microglia 
after PNI (n = 4 mice) (qPCR). 
(H and I) Representative images 
for AXL and Venus in the 

SDH on day 14 (arrowheads: 
AXL*Venus* microglia). Percent- 
age of AXL*CD11c* microglia 

(n = 3 to 5 mice). (J) Total (left) 
and CD11c!"8" microglia (right) 

in the SDH of Axl’ mice on 

day 35 (n = 10 or 11 mice). (K) /gfl 
mRNA in sorted microglia in the 
ipsilateral SDH of Axl’ mice 

on day 35 (n = 7 mice) (qPCR). 
(L) PWT of Axl” mice (n = 

0 mice). Data are shown as 
mean + SEM. *P < 0.05, **P < 
0.01, ***P < 0.001, and ****P < 
0.0001. Scale bars, 200 um 

(A) and (F), left], 20 um [(F), 
ight, and (H)], 10 wm (D), 1 um 
(E), left)], and 200 nm [(E), 
ight]. For more details, see the 
extended caption in the supple- 
mentary materials. 
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To test the role of CD11c* microglia, we used 
acell-depletion strategy in Itgax-DTR-EGFP 
mice, in which CD1lc* cells expressed diphthe- 
ria toxin receptors (DTRs) and enhanced green 
fluorescent protein (EGFP) (fig. S2). Intrathecal 
injection of diphtheria toxin (DTX) depleted 
EGFP” cells (CD11c* microglia) (Fig. 1, K and L, 
and fig. S2). In a behavioral test 14 days after 
PNI, phosphate-buffered saline (PBS)-treated 
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Itgax-DTR-EGFP mice (Fig. 1M) and DTX- 
treated wild-type (WT) mice (fig. S3) displayed 
markedly decreased PWTs, which gradually 
increased. Mice with CD11c* spinal microglia 
depletion from day 14 failed to display spon- 
taneous recovery from pain hypersensitivity 
(Fig. 1M and fig. S4). By contrast, CD11c* 
microglial depletion on day 7 had no effect 
(fig. S5). In addition, splenectomy did not 
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change the PNI-induced pain (fig. S6). Thus, 
CD1ic* microglia played a role in the remission 
of neuropathic pain. 

RNA sequencing of SDH microglia of ligax- 
Venus mice with or without PNI revealed a 
distinct transcriptional signature of CD11c"™ 
and CD11ch®" microglia (Fig. 2, A and B, and 
table S1). The differentially expressed genes 
(DEGs) in CD1ic"®" SDH microglia on day 35 
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Fig. 4. Depleting CD11c* microglia relapses PNI-induced pain hypersensitivity. 
(A to C) Recurred pain hypersensitivity in mice with depletion of CD11c* 
microglia (A), with microglia-selective IGF1-knockout (B), and with spinal IGF1 
neutralization (C). (A) DTX or PBS was intrathecally injected into Itgax-DTR-EGFP 
mice at days 35 and 42 after PNI (n = 7 mice). (B) Tamoxifen (4 mg) or corn 


after PNI included insulin-like growth fac- 
tor 1 (g/l) (Fig. 2C and table S2). Quantitative 
polymerase chain reaction (qPCR) analysis con- 
firmed that Jgff mRNA was increased in CD11c* 
microglia after PNI and peaked on day 35 (Fig. 
2D and fig. S7). Igff mRNA was detected in 
CD1lic* SDH microglia by in situ hybridization 
(Fig. 2E). To determine the role of IGF1, we 
used Itgax-Cre;Igft"°*/" mice to knock out 
IGF1 in CD11c* cells. After PNI, these mice also 
showed no recovery from pain hypersensitivity 
(Fig. 2F). Similar results were obtained from 
tamoxifen-treated Cr3cr1 8? Jeff" mice 
(Fig. 2G), enabling deletion of the Jgf7 gene in 
CX3CRI* cells (9), including all microglia and 
central nervous system (CNS)-associated mac- 
rophages (fig. S8) (10). The repeated intra- 
thecal administration of an IGF1-neutralizing 
antibody, anti-IGF1#1, to PNI mice prevented 
pain recovery (Fig. 2H). A similar effect was 
obtained using another IGFI-neutralizing 
antibody, anti-IGF1#2 (fig. S9). The repeated 
intrathecal administration of recombinant 
IGF1 to PNI mice accelerated pain recovery 
(Fig. 21). IGF1 was also observed in neurons 
(fig. S10, A to F), but Jgf7 knockdown in SDH 
neurons did not change PNI-induced hyper- 
sensitivity (fig. S10, G and H). Thus, IGF1 
derived from CD11c* microglia was neces- 
sary for the spontaneous remission of neu- 
ropathic pain. 

From the data showing no increase in CD11c* 
SDH microglia in a model of inflammatory 
pain (fig. S11, A and B), the mechanism under- 
lying the PNI-induced appearance of CD11c* 
microglia was predicted to involve an alter- 
ation specific to PNI. CD11c* SDH microglia 
increased after the intraplantar injection of 
saporin conjugated with cholera toxin B (CTB) 
but not with isolectin B4 (IB4) (Fig. 3, A and 
B), which kill myelinated and unmyelinated 
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Post-PNI (days) 


days 37 and 39 after PNI 
into WT mice from day 35 


extended caption in the s 


primary afferents, respectively (17, 12). PNI did 
not induce proliferation of existing CD11c* 
microglia (fig. S11C). These results, together 
with the fact that CD11c* SDH microglia ex- 
pressed genes associated with phagocytosis 
(table S1), raised the possibility that the ap- 
pearance of CD11c* SDH microglia after PNI 
could be related to the phagocytosis of myelin 
from damaged primary afferents. Primary cul- 
tured CD1ic* spinal microglia from adult Jtgax- 
Venus mice after PNI showed the increased 
uptake of myelin debris (Fig. 3C). In the SDH, 
CD1ic* microglia had myelin-basic protein 
(MBP) in their lysosomes (CD68"*) after PNI 
increased (Fig. 3D). Electron microscopy re- 
vealed CDl1c* SDH microglia containing phago- 
cytosed myelin particles after PNI (Fig. 3E). 
The intraspinal injection of purified myelin 
debris into normal Jigax-Venus mice induced 
the engulfment of myelin debris and caused 
microglia to become CD11c* (Fig. 3F). Further- 
more, after PNI, cDuic@2" SDH microglia 
expressed Avl mRNA (Figs. 2, Band C, and 
3G), amember of the TAM (TYRO3, AXL, and 
MERTK) family of tyrosine kinase receptors 
implicated in the engulfment of myelin debris 
(13-15). Mertk but not Tyro3 mRNA was also 
expressed, but its level was comparable be- 
tween CD1ic"® and CD11c"®" microglia and 
was not changed after PNI (fig. S12A). AXL 
protein was expressed in CD11c* SDH microg- 
lia in PNI mice (Fig. 3, H and I). Moreover, 
Aatl’~ mice showed a reduction of the PNI- 
induced increase in CD11c™" SDH microglia 
(Fig. 3J) and of the gf] mRNA up-regulation 
in microglia (Fig. 3K), along with prolonged 
pain hypersensitivity (Fig. 3L). In sorted SDH 
microglia, the peak expression of Al matched 
that of Itgax and preceded that of Jgff mRNA 
(fig. S12). Thus, AXL participated in the PNI- 
induced appearance of CDiic2" SDH microg- 


*P < 0.01, ***P < 0.001, 


Post-PNI (days) 


oil (control) were subcutaneously injected into Cx3cr1°°R™-gf1"°"™ mice at 


(n = 7 mice). (C) Anti-IGF1#1 was intrathecally injected 
after PNI (n = 6 mice). Data are shown as mean + SEM. 
and ****P < 0.0001. For more details, see the 
upplementary materials. 


lia, presumably through the phagocytosis of 
myelin debris. 

CD1ic* microglia remained in the SDH for 
extended periods of time, even after the ap- 
parent behavioral signs of pain hypersensitivity 
disappeared. Depleting CD11c* microglia by 
intrathecal injection of DTX on day 35 (fig. 
S$13A), when PWT had returned to the basal 
level, resulted in a relapse in pain hypersen- 
sitivity (Fig. 4A). The role of CDl1c* SDH mi- 
croglia in relapsing pain was supported by the 
data showing no effect of DTX on PWT in WT 
mice (fig. S13B) and no effect of deletion of other 
CDlic* cell types in the periphery (fig. S14). A 
similar recurrence of pain hypersensitivity was 
observed in Cx3er]°ER Jeff" mice when 
tamoxifen was administered at days 37 and 39 
after PNI (Fig. 4B), a treatment that efficiently 
knocked down Jgf? mRNA expression in SDH 
microglia (fig. S15). Moreover, intrathecally 
administered the neutralizing antibody anti- 
IGF1#1 (Fig. 4C) or the IGF1 receptor inhibitor 
JB-1 (fig. S16) also caused the reemergence 
of hypersensitivity. Thus, CD11c* microglia- 
derived IGF1 in the spinal cord repressed the 
manifestation of a neuropathic, hypersensitive 
state in an ongoing fashion. In addition, there 
were no sex differences in the appearance of 
CDllc* microglia in the SDH after PNI or in 
their role in neuropathic pain (fig. S17). 

This study revealed that CD11c* microglia 
are essential for the remission and recurrence 
of neuropathic pain. This implies that, although 
mice appeared to exhibit normal behavioral 
responses several weeks after PNI, this was 
not due to the normalization of a patholog- 
ically altered hypersensitive state but rather to 
the dynamic equilibrium between the ongoing 
pain facilitation and repression, the latter of 
which primarily involves IGF1-expressing 
CDllc* microglia. Shifting microglia toward 
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this state partly involved AXL and myelin 
phagocytosis. Pain repression by spinal IGF1 
might be associated with its reported pleio- 
tropic effects, including the modulation of 
neuronal activity (16) and astrocytic and mi- 
croglial inflammatory responses (17). CD11c* 
microglia have also been shown to increase 
in the brain in neurological disease models 
(7, 18). Thus, our findings could yield an 
avenue for the development of therapeutic 
strategies to treat neuropathic pain and other 
pathological symptoms. 
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MEMBRANES 


An integrated materials approach to ultrapermeable 
and ultraselective CO, polymer membranes 


Marius Sandru*, Eugenia M. Sandru’, Wade F. Ingram?, Jing Deng®, Per M. Stenstad?, 


Liyuan Deng, Richard J. Spontak”+* 


Advances in membrane technologies that combine greatly improved carbon dioxide (CO2) separation 
efficacy with low costs, facile fabrication, feasible upscaling, and mechanical robustness are needed 
to help mitigate global climate change. We introduce a hybrid-integrated membrane strategy wherein a 
high-permeability thin film is chemically functionalized with a patchy CO2-philic grafted chain surface 
layer. A high-solubility mechanism enriches the concentration of COz in the surface layer hydrated 

by water vapor naturally present in target gas streams, followed by fast CO2 transport through a highly 
permeable (but low-selectivity) polymer substrate. Analytical methods confirm the existence of an 
amine surface layer. Integrated multilayer membranes prepared in this way are not diffusion limited 
and retain much of their high CO2 permeability, and their CO2 selectivity is concurrently increased in 


some cases by more than ~150-fold. 


n alternative to chemical absorption for 

the capture of CO, derives from polymer 

gas-separation membranes (7-3) that can 

be chemically tailored to achieve differ- 

ent levels of CO, selectivity and perme- 
ability. Cross-linked elastomeric membranes 
produced from polyethers are reverse selective— 
which means that their selectivity is based 
on solubility rather than size (i.e., diffusion) 
considerations—and display a strong chem- 
ical affinity for CO, (4-7). Alternatively, glassy 
polymers, such as polyimides and polysul- 
fones, operate on molecular-size sieving and 
can be modified in various ways to alter the 
free-volume pathways through which pene- 
trant molecules migrate (8-70). In addition 
to these mature membrane technologies, 
recent studies have demonstrated that hu- 
midified block ionomers promote high CO. 
permeabilities at moderate selectivities (11, 12), 
whereas membranes containing mobile CO,- 
philic carriers can yield high CO, selectivities 
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at relatively low permeabilities (13). Although 
these advances, as well as those regarding 
other classes of polymer membranes, are 
encouraging, they confirm the trade-off be- 
tween permeability and selectivity, which is 
manifested as the empirical Robeson upper 
bound (74). 

In efforts to improve the gas flux through a 
membrane, polymer membranes have evolved 
from self-standing films measuring ~50 to 
100 um thick to supported thin films typical- 
ly one to two orders of magnitude thinner 
(5, 16), as illustrated in Fig. 1A. The motiva- 
tion for doing so is that thin films retain the 
polymer characteristics responsible for selec- 
tivity and reduce the diffusive path length 
(because macroporous supports afford a neg- 
ligible barrier to diffusion). A special case 
of CO, permeation is based on chemically 
augmented facilitated transport, wherein 
CO, molecules react with specific chemical 
groups in a hydrated membrane and conse- 
quently permeate more quickly (17-27). Details 
of these transport mechanisms are provided 
in the supplementary materials (section 1). A 
polymer that is technologically promising is 
hydrated polyvinylamine (PVAm) (/9, 22), 
which has been reported (20) to exhibit CO2/N5 
selectivities up to 500, depending on test con- 
ditions and specimen preparation. Unfortu- 
nately, CO permeability is diffusion hindered, 
even through hydrated PVAm membranes 
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that are less than a micrometer thick. An im- 
portant design paradigm drawn from this 
observation is that the ultrahigh CO, selec- 
tivity afforded by facilitated transport should 
be preserved, whereas the extent to which dif- 
fusive limitations negatively affect CO. perme- 
ability should be lessened. 

One route to fabricate a polymer mem- 
brane having both characteristics is included 
in Fig. 1A. Here, a high-solubility and fast- 
diffusion membrane is generated by grow- 
ing an ultrathin amine-containing surface 
layer on a supported high-permeability poly- 
mer thin film. These hybrid-integrated (HI) 
membranes exploit the tunable surface func- 
tionalization introduced through surface poly- 
merization (23, 24) so that their surface is 
covered by a highly CO.-philic and hydro- 
philic nanoscale polymer layer, which is con- 
ceptually similar to surface-modified polymer 
membranes produced for liquid separation (25). 
Within this nanofabricated design (achieved 
using the setup pictured in fig. S1; supple- 
mentary materials, section 2), a mixture of 
COs, No, and H,O gases initially encounters 
a molecularly thin amine-rich layer (region I 
in Fig. 1B), and the CO, permeates through 
this hydrated layer as a result of facilitated 
transport without appreciable diffusive re- 
sistance. In essence, this layer serves to con- 
centrate CO, before it enters the underlying 
substrate. An elevated population of CO, 
molecules then enters and quickly permeates 
through a dense polymer support (region IT) 
and then a macroporous support (region IID, 
yielding both high CO./Nz selectivity and 
high CO, permeability. This scenario, depicted 
in Fig. 1B, represents a combinatorial sepa- 
ration mechanism that marries fast facili- 
tated transport (high CO, selectivity) at the 
membrane surface in region I with solution- 
diffusion transport (high CO, permeability) 
through region II. 

Two amorphous commercialized gas- 
separation membranes having relatively high 
CO, permeability—elastomeric polydimethyl- 
siloxane (PDMS) and glassy polytetrafluoro- 
ethylene (PTFE AF)—were selected for surface 
modification. The PDMS membranes were 
procured as commercial membrane systems 
(DeltaMem AG, Allschwil, Switzerland) com- 
posed of PDMS thin films measuring 6 um 
thick and supported on porous polyacrylo- 
nitrile mounts. The PTFE AF membranes were 
produced in house by casting Teflon AF (amor- 
phous fluoropolymer) 2400 (Chemours, Geneva, 
Switzerland) from 3M Fluorinert Electronic 
Liquid FC-72 (Nordmann Nordic AB, Stockholm, 
Sweden) into Teflon petri dishes measuring 
10 cm in diameter. The resultant films were 
first dried at 25°C for 24 hours and then 
vacuum-dried at 200°C for at least 24 hours 
to yield films measuring 25 to 50 um thick. 
To avoid edge effects, they were cut into circles 
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(8 cm diameter) and mechanically supported 
on porous poly(vinylidene fluoride) mounts 
(MFP2, with a pore size of 0.2 um) manufac- 
tured by Alfa Laval (Soborg, Denmark). 

The series of reactions used to chemically 
alter the polymer surfaces is portrayed in 
Fig. 1C and begins with the introduction of 
initiator sites by proton extraction under ultra- 
violet (UV) conditions. For this purpose, 
benzophenone is chemically attached to each 
membrane surface to introduce reactive hy- 
droxyl moieties (26). The surface density of 
initiator sites was controlled by removing 
unbound initiator with methanol or acetoni- 
trile and varying the initiator concentration 
(from 1 to 15% in acetonitrile), with the best 
CO. permeation results achieved at 10% ben- 
zophenone. These sites are then reacted with 
glycidyl methacrylate (GMA) monomer (3 to 
15% in methanol or acetonitrile, with 10% 
yielding the most promising results) and 
polymerized under UV radiation to yield poly 
(glycidyl methacrylate) (PGMA) chains on 
the surface of each membrane. Epoxy ring- 
opening the PGMA chains with 20 to 50% 


lov 


A Self-standing S53 
peers eS 


PETIT EW ELE LIE 


Integrated 


(not drawn to scale) 


ethylene diamine (EDA) in aqueous solution 
yielded the desired amine functionality. Ad- 
ditional details regarding the surface modi- 
fication are provided in the supplementary 
materials (section 3). According to x-ray pho- 
toelectron spectroscopy (XPS) spectra collected 
on a Kratos Ultra spectrometer (Fig. 1D), both 
surface-modified polymers exhibit evidence 
of surface nitrogen, confirming the presence 
of surface amine groups after the final reaction 
step. Complementary spectroscopic results are 
included in figs. S2 to S4. 

Incorporation of a patchy ultrathin amine 
layer on the surface of PDMS, an elastomer, 
and PTFE AF, a glassy polymer, results in pro- 
found changes in CO, permeation and se- 
lectivity, as evidenced by the pressure- and 
temperature-dependent measurements com- 
piled in Fig. 2. Gas permeability tests were 
performed with a real-time automated data 
collection setup (fig. S5) wherein feed pressure, 
humidity, temperature, and He sweep are all 
precisely regulated. All the results reported 
here—measured from a mixed 10/90 CO./No 
gas feed (at 100% relative humidity) that is 
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Fig. 1. Conceptual and experimental details of the studied Hl membranes. (A) Schematic diagram displaying 
the evolution of gas-separation polymer membranes from self-standing to supported to integrated (labeled). 

(B) Illustration of the underlying mechanism behind HI membranes. Here, a surface-grown amine-rich layer (region !) 
is grown on the surface of a highly permeable polymer thin film (region II) supported by conventional macroporous 
polymers (region Ill). Selectivity is exclusively achieved in region |, which requires hydration. (C) The chemical 
route used to grow amine-modified chains on polymer substrates. See the text and the supplementary materials 
(section 2) for details. (D) Representative XPS spectra acquired from PTFE AF (top) and PDMS (bottom) membranes 
before (black lines) and after (blue lines) surface modification. The presence of the nitrogen (N 1s) peak associated 
with the amine-rich surface layer is highlighted in each surface-modified spectrum. 


1 APRIL 2022 * VOL 376 ISSUE 6588 91 


RESEARCH | REPORTS 


representative of a broad spectrum of CO,- 
related separations and a standard feed pres- 
sure (1.2 bar) and temperature (25° + 1°C), 
unless otherwise indicated—reflect the most 
common conditions encountered for gases 
containing CO,. The dependence of CO, per- 
meability on pressure displayed in Fig. 2A 
confirms that the parent PDMS and PTFE AF 
polymer thin films have comparable perme- 
abilities and remain largely unaffected by in- 
creasing pressure over the range examined, as 
expected for solution-diffusion permeation. By 
contrast, the CO, permeabilities of the amine- 
functionalized membranes (am-PDMS and am- 
PTFE AF) are lower than those of their parent 
membranes, assuming negligible resistance 
from the membrane supports, and they de- 
crease with increasing pressure, which is in- 
dicative of facilitated transport (17, 19, 20). 

Similarly, the mixed-gas CO./Nz selectivities 
of the unmodified and modified polymers de- 
crease slightly with increasing pressure in Fig. 
2C, whereas the am-PDMS and am-PTFE AF 
polymers exhibit highly elevated selectivity 
levels. The temperature-dependent permeabil- 
ities provided in Fig. 2B reveal that the parent 
polymers are either insensitive to or increase 
slightly with increasing temperature. Although 
analogous behavior is observed for am-PTFE 
AF, the am-PDMS membrane appears to be 
more highly temperature dependent. This re- 
sponse difference is attributed to the elasto- 
meric (flexible) nature of PDMS versus the 
more rigid nature of glassy PTFE AF, because 
the materials are expected to expand differ- 
ently upon heating, in accord with their co- 
efficients of thermal expansion: 9.6 x 10°* °C? 
(PDMS) and 3.0 x 10-* °C"! (PTFE AF). The 
corresponding CO./Nz selectivities included 
in Fig. 2D indicate that those of the parent 
polymers are comparably low and decrease 
slightly further with increasing temperature. 
By contrast, the selectivities of the surface- 
functionalized polymers are much higher but, 
in similar fashion to their CO, permeabilities, 
exhibit different thermal behavior. Additional 
measurement conditions are considered in 
figs. S6 to S8 in section 4. 

The improvements in CO./Nz selectivity ap- 
parent in Fig. 2 in the presence of moisture 
are a consequence of the hydrated, short- 
chain patchy amine layers on the surfaces of 
PDMS and PTFE AF. Low-voltage scanning 
electron microscopy (SEM) images of mem- 
branes acquired on an ultrahigh-resolution 
FEI Verios 460L Schottkey emitter electron 
microscope at an accelerating voltage of 1.0 kV 
before and after surface functionalization 
are displayed in Fig. 3. In Fig. 3A, the surface 
of the unmodified PTFE AF thin film reveals 
the presence of surface pits (~80 to 450 nm 
in diameter), which, upon functionalization in 
Fig. 3B, become completely obscured. Because 
the surface of neat PDMS is featureless, Fig. 3, 
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Fig. 2. Molecular transport through 
parent and amine-modified 

PDMS and PTFE AF membranes. 

(A to D) CO2 permeability [(A) and 
(B)] and CO2/Nz selectivity values 
[(C) and (D)] measured as functions 
of feed pressure [(A) and (C)] and 
temperature [(B) and (D)] from mixed 
gases for membranes derived from 
PDMS (black symbols and lines) and 
PTFE AF (blue symbols and lines) 
before and after surface modification 
[labeled in (A)]. Included in (B) 
and (D) are measurements from 
am-PDMS-2 membranes (indicated by 
triangles). The solid lines serve to 
connect the data. In variable-pressure 
tests, the temperature is maintained 
at 25° + 1°C, whereas the pressure 

in variable-temperature tests is kept 
at 1.2 bar, and the feed stream is fully 
humidified (100% relative humidity). 
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C and D, focuses on surface characteristics 
that develop during surface polymerization. 
These features resemble dewetting patterns, 
which implies that one of the steps in the re- 
action sequence did not occur uniformly on 
the PDMS surface. However, closer examina- 
tion of the membranes before and after sur- 
face functionalization reveals a more complex 
picture. Height and amplitude atomic force 
microscopy (AFM) images collected on an Asy- 
lum MFP-3D probe microscope are displayed 
for membranes based on am-PTFE AF in Fig. 
3, E and F, and for membranes based on am- 
PDMS in Fig. 3, G and H. The insets corre- 
spond to the unmodified polymer films, and 
the height images yield root mean square 
(RMS) roughness values of 6.85 and 0.99 nm 
for virgin PTFE AF and PDMS, respectively. 
The modified surfaces consist of discrete 
nanodroplets, confirming the likelihood of sur- 
face dewetting during functionalization. The 
diameters of these features measure 42 to 
264 nm (127 nm average) on PTFE AF and 41 
to 179 nm (95.9 nm average) on PDMS, and 
their mean heights vary from 97.9 nm on PTFE 
AF to 12.3 nm on PDMS. These nanodroplets 
amplify the RMS surface roughness by ~10x 
(68.6 nm on am-PTFE AF and 8.93 nm on 
am-PDMS). Increases in the membrane sur- 
face area owing to the presence of topological 
features (relative to a flat surface) are 13.2 and 
3.6% for am-PTFE AF and am-PDMS, respec- 
tively, thereby providing greater opportunity 
for CO, molecules to interact with and trans- 
port through the amine-rich layer. A geometric 
interpretation of this increase in surface area 
is provided in figs. S9 and S10. By changing the 


Pressure (bar) 


Temperature (°C) 


intensity and exposure area of the surface 
polymerization (supplementary materials, sec- 
tion 5), macroscopic dewetting can be elimi- 
nated on PDMS, yielding a second-generation 
HI membrane designated as am-PDMS-2. Ac- 
cording to SEM images of am-PDMS-2 mem- 
branes, no large features are evident (fig. SILA), 
whereas a nanoscale topology is discernible 
(fig. S11B). Corresponding height and phase 
AFM images are presented in Fig. 3, land J 
[as well as three-dimensional (3D) topograph- 
ical images in fig. $12], that confirm the exist- 
ence of a patchy nanoscale topology with a 
substantially reduced RMS surface rough- 
ness (1.68 nm) on the am-PDMS-2 membranes. 
The circled features in Fig. 3, I and J, iden- 
tify protrusions measuring up to ~12 nm in 
height that are soft relative to the surround- 
ing matrix, indicating that they represent 
low-density, surface-grafted chain patches. 
Avoiding surface heterogeneities on the func- 
tionalized PDMS membranes not only in- 
creases uniformity but also improves both 
CO, permeability (Fig. 2B) and CO./Nz selec- 
tivity (Fig. 2D). 

The performance of these HI membranes 
relative to other polymer membranes intended 
for CO, separation is routinely assessed on a 
Robeson plot, which evaluates membrane se- 
lectivity as a function of permeability. This 
representation identifies the empirical upper 
bound that reflects the trade-off between se- 
lectivity and permeability. As membranes have 
continued to improve since the original upper 
bound was identified (27), the position of the 
upper bound has progressively shifted upward. 
The upper bound identified by Robeson (14) 
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Fig. 3. Topological characteristics of parent and amine-modified PDMS 

and PTFE AF membranes. (A and B) Low-voltage planar SEM images of solvent- 
cast PTFE AF membranes before (A) and after (B) surface modification. Examples 
of pits measuring between ~80 and 450 nm in diameter are highlighted by circles 
in (A), whereas a region that is reminiscent of ridges such as those evident in (A) 

is identified by the square in (B). (€ and D) SEM images reveal the surface features 
that develop on first-generation PDMS membranes upon surface modification. 
The inset in (D) is a 3x enlargement of one of the regions identified by arrows. 
(E to H) Complementary AFM height [(E) and (G)] and amplitude [(F) and (H)] 


is identified in Fig. 4A, which compares mem- 
branes that function according to the solution- 
diffusion mechanism. Included in this figure is 
a more recent (and shifted) upper bound pro- 
posed by Comesafia-Gandara et al. (28) in 2019 
on the basis of polymers of intrinsic micro- 
porosity (PIMs). The PIM membranes afford 
very high CO, permeabilities at modest CO./ 
Nj selectivity levels (29, 30), whereas a variety 
of other membranes also displayed in Fig. 4A 
range in permeability and selectivity from rel- 
atively low to relatively high at ~25° to 35°C. 
To provide an additional classification descrip- 
tion of these performance metrics, we divide 
the Robeson plot into quadrants and refer to 
membranes exhibiting CO. permeabilities in 
excess of 10° Barrer [where 1 Barrer = 10° 
aaa (cmérp, standard cubic centimeter; 
cmHg, centimeter of mercury)] as ultraper- 
meable (uP) and membranes achieving CO./ 
Nj selectivities greater than 10” as ultrase- 
lective (uS). Although many of the membranes 
presented in Fig. 4A (including those prepared 
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Scalebar = 2 um 


from unmodified PTFE AF and PDMS) are uP 
in the fourth quadrant, notably few are uS. In 
fact, two types of membranes evaluated for 
pilot-scale separation of CO, from flue gas are 
neither uP nor uS (in the third quadrant). This 
disparity suggests that membranes fabricated 
for CO, capture focus primarily on improved 
permeability (6, 28, 37). 

Examples satisfying the criterion set forth 
above for uS membranes in the second quad- 
rant of Fig. 4A are cellulose nanofiber/ionic 
liquid (CNF/IL) membranes (which rely on hy- 
drated IL nanochannels) (73) as well as most of 
the am-PDMS and am-PDMS-2 HI membranes 
investigated in this work and several materials 
reported in the literature. However, only a few 
membranes including a bioinspired thin film 
(32) and several membranes incorporating 
graphene oxide nanosheets (33-35) can be 
classified as both uP and uS in the first quad- 
rant of Fig. 4A. In marked contrast, all the 
am-PTFE AF membranes investigated here 
at temperatures from 25° to 55°C, as well as 


Scalebar = 100 nm 


images corresponding to PTFE AF [(E) and (F)] and PMDS [(G) and (H)] membranes 
after surface modification (the inset displayed at the same magnification in each 
frame is acquired from the parent, unmodified surface). (I and J) The paired height 
and phase AFM images of the surface of an am-PDMS-2 membrane in (I) and 

(J), respectively, confirm the existence of patchy amine-rich regions. The color- 
matched circles identify protrusions (light) in (I) that are mechanically soft (dark) in 
(J). Values of the surface RMS roughness measured from AFM images, such as 
these are provided in the text, and 3D topographical images of the am-PDMS-2 
membrane are included in fig. S12. 


the am-PDMS-2 membrane evaluated at 55°C, 
are both uP and uS. Although surface amina- 
tion of the PTFE AF membranes reduces their 
CO, permeability by 50 to 60%, the corre- 
sponding increase in CO,/Nz, selectivity is 
~150x greater. These results confirm that the 
patchy, low-density surface amine layer pro- 
vides a highly hydrophilic and CO,-philic envi- 
ronment that absorbs and transports CO, 
molecules by means of facilitated transport, 
whereas the highly permeable substrate pro- 
vides contiguous pathways with little barrier 
to molecular diffusion. For added compari- 
son, the results displayed in Fig. 4B confirm 
that the transport metrics reported here for 
the HI membranes have higher CO, selectiv- 
ities than most purely facilitated-transport 
membranes, which primarily lie in the sec- 
ond quadrant (with two selected for pilot- 
scale evaluation). 

In Fig. 4A, the CO./Nz selectivity achieved 
at 1200 Barrer for the am-PTFE AF membrane 
is substantially higher than the Robeson (14) 
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Fig. 4. Performance comparison of HI membranes 
relative to other polymer membranes developed 
for COz capture. (A) Performance metrics of the 

HI membranes studied in this work (am-PDMS, open 
black circles; am-PDMS-2, open black triangles; 
am-PTFE AF, open blue circles) relative to their 
parent materials (PDMS, filled black circle; PTFE AF, 
filled blue circle) as well as other polymer membranes 
proposed for CO. capture on the basis of the solution- 
diffusion mechanism. According to the requirements 
proposed here, uP membranes exhibiting CO2 perme- 
abilities beyond 10° Barrer (in the fourth quadrant) 
include both PDMS and PTFE AF precursors in addition 
to several other promising polymers (open brown 
diamonds) and those in the PIM (open red squares) 
category. Membranes subjected to pilot-scale evaluation 
(filled brown diamonds) are identified in the third 
quadrant and listed in table S1. In the second quadrant, 
uS membranes, such as those based on CNF/IL 
membranes (13) (filled green square), reach CO2/Nz2 
selectivities of more than 10° at lower CO» permeability 
evels. The Goldilocks (first) quadrant (shaded yellow) 
combines both uP and uS performance. The solid lines 
identify the permeability-selectivity trade-off proposed 
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by Robeson (14) for various polymer membranes (gray) and more recently by Comesafia-Gandara et al. (28) 
membranes (red). (B) Performance metrics of the HI membranes relative to hydrated facilitated- 
transport membranes (open purple inverted triangles) in the first and second quadrants only. Promising 
membranes assessed in pilot-scale testing (closed purple inverted triangles) are circled and compiled 

in table S1. The color-coded arrows in (A) and (B) indicate the effect of increasing temperature (AT) on the 


HI membranes derived from PDMS (black) and PTFE AF (blue). 


or PIM-based (28) upper-bound thresholds (by 
~30x and ~20x, respectively). Equally impor- 
tant is our observation that the enhanced sep- 
aration performance of am-PTFE AF at different 
temperatures in Fig. 4A varies less compared 
with that of the am-PDMS and am-PDMS-2 
membranes. The CO, transport characteristics 
of the am-PTFE AF and am-PDMS-2 mem- 
branes converge as the amine surface layers 
exhibit less heterogeneities during polymer- 
ization (supplementary materials, section 5). 
Differences in separation performance accom- 
panying variations in surface modification are 
considered in figs. S13 to S15, and the results 
reported here for CO,/N» gas mixtures are 
extended to CO./CH, mixtures in fig. S16, which 
reveals that the am-PTFE AF membranes re- 
tain their uP and uS performance with regard 
to CO. in the case of two dissimilar gas mix- 
tures. Additionally, aspects of durability and 
adaptability of the am-PDMS-2 and am-PTFE 
AF membranes are examined in figs. S17 and 
S18, respectively, and results collected from 
single- and mixed-gas tests are quantitatively 
compared in fig. S19 to further elucidate the 
molecular-level mechanism by which CO, per- 
meates through the HI membranes studied in 
this work. 

Although the results reported herein are 
highly encouraging, three other relevant issues 
should be recognized. First, the membranes 
developed in this work derive from polymers 
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that are already used for commercial gas sepa- 
rations, in which case they are widely available 
and the processes required to manufacture and 
upscale gas-separation modules for industrial 
operations would accommodate the surface- 
functionalized analogs. Second, because the 
PTFE AF and PDMS membranes are already 
in commercial use, the added expense of sur- 
face functionalization is most likely much less 
than introducing a new designer membrane. 
Third, the methodology described in this study 
is not restricted to PTFE AF and PDMS with 
a patchy, amine-rich surface layer. We have 
selected the systems discussed here as exem- 
plars to provide proof of concept. By marry- 
ing the facilitated transport of a molecularly 
thin amine surface layer with a high-permeability 
polymer film, we have successfully demon- 
strated that HI membranes can be nanofabri- 
cated to take advantage of this high-solubility, 
quick-diffusion combination and produce next- 
generation membranes that, instead of being 
limited by a permeability-selectivity trade-off, 
are both uP and uS. 
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COGNITIVE SCIENCE 


Complex cognitive algorithms preserved by selective 
social learning in experimental populations 


B. Thompson™*+, B. van Opheusden*+, T. Sumers?, T. L. Griffiths”? 


Many human abilities rely on cognitive algorithms discovered by previous generations. Cultural 
accumulation of innovative algorithms is hard to explain because complex concepts are difficult to 
pass on. We found that selective social learning preserved rare discoveries of exceptional algorithms 
in a large experimental simulation of cultural evolution. Participants (N = 3450) faced a difficult 
sequential decision problem (sorting an unknown sequence of numbers) and transmitted solutions 
across 12 generations in 20 populations. Several known sorting algorithms were discovered. Complex 
algorithms persisted when participants could choose who to learn from but frequently became extinct in 
populations lacking this selection process, converging on highly transmissible lower-performance 
algorithms. These results provide experimental evidence for hypothesized links between sociality and 


cognitive function in humans. 


eading, counting, cooking, and sailing 
are just some of the human abilities 
passed from generation to generation 
through social learning (1, 2). Complex 
abilities like these often depend on 
learned cognitive algorithms: procedural rep- 
resentations of a problem that coordinate mem- 
ory, attention, and perception into sequences 
of useful computations and actions. Accumu- 
lation of complex algorithms—from ancient 
tool-making techniques to bread making, boat 
building, or horticulture—is central to human 
adaptation (3-5) yet challenging to explain 
because algorithmic concepts can be difficult 
to discover, communicate, and learn from ob- 
servation (6), making them vulnerable to loss 
(7-10). Theories of cultural evolution suggest 
that human social learning may help overcome 
this fragility (11, 12). For example, mathematical 
models (13) predict that choosing to learn from 
successful or prestigious individuals can pre- 
vent the loss of rare innovations. However, this 
potential link between sociality and complex 
abilities (74) is challenging to establish. 

We conducted large-scale simulations of 
cultural evolution with human participants 
to assess how selective social learning in- 
fluenced the evolution of cognitive algorithms. 
Prior research (15-22) shows that social learn- 
ing can improve decisions in multiple-choice 
tasks (23), perceptual judgments (24), and 
search problems (25) and can improve arti- 
facts such as physical structures (26) or com- 
puter programs (27). However, the evolution of 
cognitive algorithms at the population level 
has been difficult to study (28). We developed 
custom software to recruit large numbers of 
participants online and organize them into 
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evolving societies facing a common problem. 
Twenty populations tackled a sequential de- 
cision problem (Fig. 1A and materials and 
methods 1.1). Presented with six images, par- 
ticipants attempted to establish hidden arbi- 
trary orderings using pairwise comparisons. 
Out-of-order pairs swapped positions when 
compared. Participants were rewarded for 
establishing the ordering using fewer com- 
parisons. This task poses a sorting problem, 
requiring a strategy for executing appropriate 
sequences of actions, analogous to culturally 
evolved strategies for making tools or food. 
Participants transmitted their strategies 
across 12 generations (Fig. 1B). Initial asocial 
generations completed the task individually. 
Participants in generations 2 to 12 were ran- 
domly assigned to either the experimental 
selective social learning (SSL) treatment (par- 
ticipants could choose demonstrators from the 
previous generation based on demonstrator 
performance) or to a control random mixing 
(RM) treatment (demonstrator performance 
was not visible). Participants were incentivized 
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to improve on strategies that they observed 
and provide a helpful description and demon- 
stration of their own strategy (materials and 
methods 1.1.13). 

Figure 2 shows task performance. Asocial 
participants achieved the lowest scores [mean 
over scored trials = 0.33, SD = 0.19, relative to 
a theoretical maximum of 1; supplementary 
materials (SM) 3; mean number of success- 
ful trials = 6.45 of 10 scored trials, SD = 2.89; 
SM 2.1.1]. Asocial participants improved over 
trials, confirming within-task learning {regres- 
sion coefficient posterior mean (f) = 0.14, 97% 
highest density interval (HDI) = [0.11, 0.16]; 
SM 2.1.1}. Participants who learned socially 
(generations 2 to 12) achieved higher perform- 
ance [mean (M) = 0.446, SD = 0.199] than 
asocial participants (8 = 0.114, 97% HDI = 
[0.083, 0.145]; SM 2.1.2). RM participants (M = 
0.399, SD = 0.167) outperformed asocial par- 
ticipants (B = 0.067, 97% HDI = [0.036, 0.098]; 
SM 2.1.3). SSL participants performed highest 
(M = 0.493, SD = 0.216), outperforming asocial 
(B = 0.162, 97% HDI = [0.13, 0.193]) and RM 
(B = 0.095, 97% HDI = [0.095, 0.107]; SM 2.1.3) 
participants. 

By the final generations (9 to 12), perform- 
ance among SSL (M = 0.514, SD = 0.216) and 
RM (M = 0.433, SD = 0.163) participants in- 
creased substantially relative to asocial par- 
ticipants (SSL: B = 0.181, 97% HDI = [0.147, 
0.221]; RM: B = 0.101, 97% HDI = [0.071, 0.132]; 
SM 2.1.5 and 2.1.6). This was not just a con- 
sequence of accumulating more knowledge 
of the task: In a follow-up experiment, asocial 
participants completed 52 trials (equivalent to 
four generations) yet performed worse than 
SSL participants (but better than RM partic- 
ipants; SM 1.3). In the SSL group, but not the 
RM group, a distinct population of high per- 
formers emerged (performance > 0.6; Fig. 2B). 
Most of these participants (85.2%) learned 
from a high-performing demonstrator (Fig. 
2C). However, most people who learned from 
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Fig. 1. Problem-solving task and study design. (A) Sorting task. (B) Participants chose up to three 
demonstrators from a sample of eight (out of 15) members of the preceding generation. Ten SSL networks 
were paralleled by 10 RM networks yoked to shared initial generations. 
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Fig. 2. Performance in the sorting task. (A) Performance over generations. Error bars show 95% confidence intervals. A, asocial. (B) Performance among SSL 
(top), RM (middle), and asocial (bottom) participants. The vertical axis shows the number of participants. The vertical dashed line indicates the high-performance 
threshold (0.6). (C) Participant performance relative to their demonstrators. t, time (generation number). 


a high-performing demonstrator were not 
themselves high performing (64.9%). This 
asymmetry indicated an exceptional strategy 
that is rarely discovered and difficult to pass on. 

To deepen our understanding of these strat- 
egies, we analyzed the algorithmic structure 
of participant responses. Using recurrent 
neural network models (SM 4), we identified 
eight classes of algorithms (SM 4.3). Partici- 
pants discovered several known sorting algo- 
rithms. Two algorithms were more frequent 
than all others. The first algorithm is known in 
computer science as selection sort. Selection 
sort generates 15 specific comparisons (1-2, 1-3, 
1-4, 1-5, 1-6, 2-3, 2-4, 2-5, 2-6, 3-4, 3-5, 3-6, 4-5, 4- 
6, and 5-6; Fig. 3A) that are guaranteed to 
succeed (performance = 0.55). The second algo- 
rithm is an even more efficient solution, known 
as gnome sort. The comparisons implied by 
gnome sort depend on the outcome of earlier 
comparisons, so its behavior varies. Participants 
using gnome sort made specific sweeps of adja- 
cent comparisons until a pair failed to swap, re- 
membering where each sweep started (Fig. 3B). 

Overall, 80% of participants who used an 
identifiable algorithm used selection sort or 
gnome sort (SM 4.2.1). Figure 4 shows major 
independent lineages of algorithms in four 
networks (all networks are shown in the SM). 
Only 44% of asocial participants (66 of 150) 
used an identifiable algorithm: Among asocial 
participants, 16% (26) used selection sort and 
just 13% (20) used gnome sort. All other algo- 
rithms were even rarer. In early generations, 
selection sort began to spread in both the RM 
and SSL groups (SM 2.2.2), independently 
becoming the most frequent algorithm in 7 of 
10 network pairs. In the RM group, selection 
sort continued to spread, dominating 9 of 


96 1 APRIL 2022 » VOL 376 ISSUE 6588 


Initial Ordering 


Trial 4 Trial 5 


A: Selection Sort 
Trial 6 


Trial 7 Trial 8 


Participant 1546 (G10, RM) 
"| started with the first 
picture, then clicked on 
picture two. Then back to 
one, then picture three. Then 
back to one, to picture 4 and 
so on. After | was done with 
the first picture, | started with 
the second and went second 


picture to three, second to 
four, and so on. | did this till | 


was at the end." 


B: Gnome Sort 


Trial 5 


Trial 6 


Trial 7 Trial 8 


Participant 1281 (G8, SSL) 
"Begin with comparing the 
second picture to the picture 
on the left. Once it stops 
moving, move on to the third 
picture, and compare it with 

~| the pictures to the left until it 
stops moving. Then, move 


Trial 9 


Trial 11 


on to the fourth picture and 
compare it with the pictures 


Trial 13 to the left. Then, move on to 


qo 


ie 


the fifth picture and compare 
it with the pictures to the left. 
Finally, choose the sixth 
picture and compare it with 
the pictures to the left. Once 
seven CO the sixth picture stops 
moving, the pictures should 
be in numerical order." 


Fig. 3. Selection sort and gnome sort. (A and B) Strategy descriptions and executions by participants 
using selection sort [(A); participant 1546] and gnome sort [(B); participant 1281]. Squares show swap (solid) 


or no swap (transparent). G, generation. 


10 populations by generations 9 to 12 (67% of 
identifiable algorithms). However, in the SSL 
group, selection sort began to decline after 
generation 3 (Fig. 4). In the SSL group, but not 
the RM group (SM 2.2.3), gnome sort domi- 


nated 8 of 10 populations. By the final two gen- 
erations, more than half of all SSL participants 
used gnome sort. 

Participants successfully transmitted se- 
lection sort and gnome sort from person to 
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Fig. 4. Algorithm lineages. (Top) Major lineages (connected subcomponents exceeding 20 participants) in two network pairs. (Bottom) Algorithm frequencies under 


RM (left) and SSL (right). 


person. However, transmission fidelity differed 
(SM 2.3). Selection sort uses a fixed compar- 
ison sequence, so it is easier to describe and 
learn. Several algorithms (e.g., comb sort, 
bubble sort, insertion sort; SM 4.3) share this 
property but were rare, suggesting that selec- 
tion sort is particularly intuitive or memorable 
[consistent with sequence-learning studies (29) 
and mathematical analyses of attractors (30)]. 
Gnome sort was harder to pass on. Its condi- 
tional logic improved performance (sometimes 
needing only five comparisons) but made the 
algorithm harder to describe, learn, and use 
(SM 7). A follow-up study (SM 1.4) showed 
that transmission of complex algorithms in 
this task was primarily facilitated by partici- 


SCIENCE science.org 


pant demonstrations, not written descriptions 
(though descriptions do improve transmis- 
sion; SM 2.1.8). Finally, we found that selective 
social learning counteracted differences in 
algorithm transmission fidelity. SSL partici- 
pants chose higher-performing demonstra- 
tors and therefore encountered gnome sort 
more often (SM 2.4), offsetting the complexity 
disadvantage at the population level. Numer- 
ical simulations support this dynamic (SM 8). 

Our study offers insight into one potential 
mechanism linking sociality and the trans- 
mission of complex discoveries, supporting 
assumptions of mathematical models (37). 
Random mixing led to an evolutionary process 
shaped by how difficult the algorithms were 


to learn and convey, favoring highly trans- 
missible algorithms (e.g., selection sort). By 
contrast, selective social learning led to an 
evolutionary process heavily influenced by 
how efficiently participants could solve the 
underlying problem, favoring more complex 
algorithms (e.g., gnome sort). 

Sequential decision-making problems arise 
in many higher cognitive functions such as 
planning, social interaction, navigation, and 
tool use. Like the sorting problem, these chal- 
lenges call for procedural strategies that we 
can follow to reliably achieve a goal when in- 
teracting with a dynamic task. Our study 
showed that a simple form of selective social 
learning helped populations establish such 
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solutions by increasing the take-up of rare 
but innovative algorithms. Alternative routes 
to increased uptake, such as forms of content- 
biased learning (13), may have analogous 
evolutionary consequences. 

Our study does not address potential con- 
sequences of population size, demographic 
profiles, or more complex forms of demon- 
strator selection that may arise in richer con- 
texts (e.g., strategies that account for potential 
relationships between expertise and teaching 
skills). Our findings highlight interactions be- 
tween reconstructive and preservative aspects 
of cultural transmission (30). Choosing who to 
learn from helps complex ideas be preserved 
and built on by future generations in evolving 
populations. 
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CARDIOVASCULAR RESEARCH FACULTY — OPEN RANK 


The Department of Pathology and Cell Biology at the University of South Florida, Morsani College of Medicine is seeking applicants for a full-time, tenured 
or tenure track, Open Rank faculty position in a broad area of cardiovascular research. We are particularly interested in researchers utilizing animal models 
to study cardiovascular development and disease. This position will be academically appointed to the Department of Pathology and Cell Biology and will 
include opportunities to interact with researchers both on the main USF campus and the USF Health Heart Institute. Opened in 2020 in downtown Tampa, the 
USF Health Heart Institute is a modern research facility within four floors totaling 100,000 square feet dedicated to cardiovascular and related research, including 
clinical trials targeting new ways to manage, prevent, and cure cardiovascular disease. 
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committees and councils. 


REQUIREMENTS: Applicants are required to have a Doctoral degree from an accredited institution or the highest degree appropriate in the field of 
specialization. Evidence of scholarly activity, sustained independent research funding, and professional research service are also preferred as is experience 
utilizing animal models. 


Postdoctoral research experience and at least one externally funded award (NIH RO1 or equivalent, K99/RO0 awards are acceptable for applicants at the Assistant 
Professor level) are required. Must meet university criteria for appointment to the rank of Assistant, Associate, or Full Professor. For Associate/Full Professor 
evidence of creative work, professional writing or research in refereed and other professional journals, and recognized authority in the field of specialization is 
required. 


HOW TO APPLY: Applications will only be accepted through the USF Careers System. To apply, please visit: https://gems.usf.edu/ Job Posting ID 29861. In 
addition to completing the required online application, please include a cover letter, a description of research interests (limit to 3 pages or less), current curriculum 


vitae, and contact information for three references. These documents must be consolidated into one attachment. 


WORKING AT USF: With more than 16,000 employees in the USF system, the University of South Florida is one of the largest employers in the Tampa Bay 
region. At USF you will find opportunities to excel in a rich academic environment that fosters the development and advancement of our employees. We believe 
in creating a talented, engaged and driven workforce through on-going development and career opportunities. We also offer a first class benefit package that 
includes medical, dental and life insurance plans, retirement plan options, tuition program and generous leave programs and more. 


The Tampa metropolitan area is rapidly expanding and provides a culturally diverse environment that offers plenty of sunshine, world famous beaches, arts and 
entertainment, and an annual average temperature of 72 degrees. 
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The Department of Pharmacology at Tulane University School of Medicine invites applications for a full-time, 12-month position as Professor of the Practice 
in Pharmacology on the non-tenure track. The successful candidate will serve as a Principal Teacher and Coordinator of Medical and Graduate Teaching. This 
position will begin no later than August 1, 2022. 


The Department of Pharmacology has a great history of excellence in research and education in biomedical and medical pharmacology and is committed 
to providing high-quality programming at the graduate level. The Department of Pharmacology has responsibility for teaching courses in the medical and graduate 
degree programs focusing on Pharmacology. Current instructional methodologies used in teaching these courses include lectures, small group sessions, 
team-based learning (TBL), flipped classrooms, and simulation. 


The appointed candidate will serve as a Principal Teacher and Coordinator of Medical and Graduate teaching and is expected to participate in the following: 
curriculum development and implementation, graduate and undergraduate academic advising and mentoring, departmental and school committees, and scholarly 
activities. 


The requirements for this position include having either a Ph.D., M.D., D.M.D., D.O degree, or equivalent. Additionally, prior experience in teaching 
Pharmacology in a health professional program is required. Preference will be given to candidates who are outstanding teachers. 


Applicants are expected to submit a full application that includes: a cover letter; a curriculum vitae; a statement of teaching experience and teaching philosophy; 
and previous teaching evaluations or other evidence of teaching excellence, a summary of research/scholarly accomplishments, interests, and future plans. Also 
required is the contact information for 3-5 professional references including e-mail addresses. Please apply electronically in Interfolio at http://apply.interfolio. 
com/103640. Informal inquiries and further information about the position may be directed to the Department Chair (Dr. David Busija, dbusija@tulane.edu) ja@tulane.edu) 
or Director of Graduate Studies (Dr. Craig Clarkson, cclarks@tulane.edu). Review of applications will begin immediately and will continue until the position 
is filled. 


Employment Equity Statement: Tulane University is an Affirmative Action and Equal Opportunity Employer committed to excellence through diversity, 
Tulane University is strongly committed to employment equity within its community and to recruiting diverse faculty and staff. The university encourages 
applications from all qualified candidates including women, under-represented minorities, such as BIPOC (Black, Indigenous, and People of Color), 
LGBTQ, persons with disabilities and protected veterans. 
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WORKING LIFE 


By Kyra H. Kim 


102 


Bucking tradition 


was born in 1990—the year of the white horse. In the lunar zodiac system, the white horse reigns 
once every 60 years and is believed to be much stronger and faster than all other stallions. For 
boys born that year, it is considered a good sign. Not so for girls. A girl born under this zodiac 
is believed to have a wild, steedlike spirit that will cause her to buck away her fortune, destin- 
ing her family to trouble. These beliefs were indoctrinated into me as a girl growing up in South 
Korea. I was repeatedly told to fight against my steedlike spirit by striving to be still, gentle, 
and quiet. It’s an attitude that nearly prevented me from pursuing a career in the geosciences. 


When I was young, I loved ex- 
ploring the mountainous trails 
behind our house, examining 
the various soil layers in nearby 
ditches, and generally running 
wild outdoors. I often returned 
home with ripped tights and 
soiled clothes. “Maybe it’s the 
white horse in you,” my mother 
would say resignedly. 

My family elders didn’t see 
that side of me as much, but I 
frequently heard about the white 
horse from them as well. They 
boasted how “open minded” our 
family was for defying the omen 
and not terminating the preg- 
nancy that led to my birth. “Many 
girls did not make it to birth that 
year,’ they’d say. “But we don’t be- 
lieve in that myth—our women are 
disciplined and gentle.” 

I came to believe that my curious 
nature and desire to explore were not positive attributes—that 
they needed to be tamed. It didn’t help that Korean culture 
frowned on women pursuing strenuous activities; anything 
that involved heavy lifting, running, or even sweating was 
best left to men, traditional views held. I lived in unknow- 
ing submission to these ideologies for a long time, struggling 
with my desire to roam, in body and spirit. 

In high school, I settled on a quiet career choice: I 
would become a lawyer specializing in human rights and 
environmental issues. My elders praised me for a deci- 
sion they imagined would lead me to a respectable desk 
job. However, when I was accepted into a prelaw program 
and saw the curriculum, I felt a pang of disappointment: 
The courses struck me as mundane. That’s when a mentor 
nudged me in a different direction. “You could always get 
a science degree,” she said, “and go back into law.” 

I took her advice and landed in the United States to 
study geology. Shortly thereafter, though, culture shock 


“My wild, steedlike spirit 
wasn't something to suppress.” 


set in. I was not ready for the 
intense physical requirements 
of my field courses. One 6-week 
summer course required strenu- 
ous hikes, camping in extreme 
heat, and heavy lifting. My cul- 
tural upbringing had discour- 
aged such “wild” activities and 
I had never gone on extended 
hikes or camped before. 

I blundered a lot in the begin- 
ning. One night, I nearly set my 
tent on fire. But over time my 
abilities grew, along with my con- 
fidence. The other women in my 
program were an inspiration. I saw 
beauty in their strength and fitness, 
and I wanted to emulate them. It 
was freeing to know I had found 
an environment where I could par- 
ticipate in physically demanding 
activities and not feel ashamed of 
it—but rather, feel empowered. 

By the end of my undergraduate degree, I had fully 
embraced my love of being a rugged, hand-chapped, rock 
hammer-wielding girl. And I’d given up on the idea of 
becoming a lawyer. My parents didn’t quite know what 
to make of my adventurous field life. But they were sup- 
portive when I told them about the change in my career 
direction. “So, a Ph.D., right?” they asked, trying to reas- 
sure themselves that my activities were leading me down 
an academic path. “Yes, that’s the plan,” I responded. 

I’m now a postdoc with years of experience collecting 
samples in challenging field environments, and I’m thank- 
ful I bucked against the cultural expectations placed on 
me. My wild, steedlike spirit wasn’t something to suppress. 
Instead, it led me to a career that’s a perfect fit for me. 


Kyra H. Kim is a postdoctoral researcher at NASA's Jet Propulsion 
Laboratory at the California Institute of Technology. Do you have an 
interesting career story to share? Send it to SciCareerEditor@aaas.org. 


1 APRIL 2022 * VOL 376 ISSUE 6588 


science.org SCIENCE 


ILLUSTRATION: ROBERT NEUBECKER 


The next breakthrough in Making Cancer History® 


Unlocking the full potential 


of immunotherapy 


, 
\< — . 


MD Anderson Cancer Center proudly announces the launch 
of our global immunotherapy research and innovation hub, 
the James P. Allison Institute. 


Named for our Nobel laureate and world-renowned researcher Dr. Jim Allison, 
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Immunotherapy is a pioneering treatment that unleashes 
the patient’s own immune system to attack and eliminate cancer. 
The Allison Institute will bridge all research areas, 
forging new breakthroughs that continue to transform cancer care. 
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the world can begin to realize an end to cancer, which has been 
MD Anderson’s singular mission for more than 80 years. 
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